Inferential statistics in nutshell — With Python

Sanchit Tiwari
11 min readJan 9, 2020

Introduction

As a research scholar, I need to use inferential statistics in my research work to make inferences about the population from sample data. In this post, I thought to share the details around inferential statistics specifically how different statistical tests allow making interference in python.

Defining good research questions for making a sound inference is very important so before we talk about inference statistics it is important to understand that what we need to define the question. We need mainly 3 questions to answer while formulating our research question:-

What is the target population of interest?

Has the question been asked before? And will the new study add knowledge that didn’t exist before?

Are the variables readily available, measured appropriately, or feasible to measure using well-established tools?

We want to start applying statistical procedures to data and for that, we have to have a good idea of what question we’re answering in the first place.

Before we go in detail of inferential statistics it is important to understand what is descriptive statistics. Descriptive statistics is a step to do exploratory data analysis that summarizes and represents the data in a way such that patterns emerge from it. It is a simple way to describe data, but it does not help us to conclude the hypothesis that we have made. Descriptive statistics identify patterns in the data, but they don’t allow for making hypotheses about the data. Whereas Inferential statistics allow us to make hypotheses (or inferences) about a sample that can be applied to the population. Let us say for example we have collected the weight of 100 people living in India. The mean of their weight would be descriptive statistics, but their mean weight does not indicate that it’s the average weight of the whole of India. Here, inferential statistics will help us in determining what the average weight of the whole India would be, and we will go in detail of the same in this post. Inferential statistics is all about describing the larger picture of the analysis with a limited set of data and deriving conclusions from it.

In this post we will cover different elements around inferential statistics:-

  • Probability Distributions
  • Making inference(or hypotheses) from sample data
  • Hypothesis Testing
  • Type 1 and Type 2 errors
  • Confidence interval

There are two major types of random variables: discrete and continuous. Discrete random variables count things (number of heads on 10 coin flips, number of female Democrats in a sample, and so on). The most well-known discrete random variable is the binomial- A random variable is a characteristic, measurement, or count that changes randomly according to some set of probabilities. A list of all possible values of a random variable, along with their probabilities is called a probability distribution. One of the most well-known probability distributions is binomial. Binomial means “two names” and is associated with situations involving two outcomes: success or failure.

A random variable has a binomial distribution if all of the following conditions are met:

1. There are a fixed number of trials.

2. Each trial has two possible outcomes: success or failure.

3. The probability of success is the same for each trial.

4. The trials are independent, meaning the outcome of one trial doesn’t influence that of any other.

Example:- The coin-flipping example meets the four conditions, the random variable X, which counts the number of successes (heads) that occur in 10 trials, has a binomial distribution with

n = 10 and p = 1⁄2.

We can experiment with two possible outcomes: success or failure. Success has a probability of p, and failure has a probability of 1 — p. A random variable that takes a 1 value in case of success and 0 in case of failure is called a Bernoulli distribution.

The SciPy package of Python provides useful functions to perform statistical computations. You can install it from http://www.scipy.org/. The following code helps in plotting the binomial distribution:

from scipy.stats import binom

import matplotlib.pyplot as plt

fig , ax = plt.subplots(1, 1)

x = [0, 1, 2, 3, 4, 5, 6]

n, p = 6, 0.5

bi = binom(n, p)

ax.vlines(x, 0, bi.pmf(x), colors=’k’, linestyles=’-’, lw=1,

label=’Probablity’)

ax.legend(loc=’best’, frameon=False)

plt.show()

Below is the output graph:-

Figure 1

A continuous random variable measures things and takes on values within an interval, or they have so many possible values that they might as well be deemed continuous (for example, time to complete a task, exam scores, and so on). The most famous continuous random variable is normal and when we list all possible values of normal along with their probabilities then it shows us the normal distribution.

We say that X has a normal distribution if its values fall into a smooth (continuous) curve with a bell-shaped, symmetric pattern, meaning it looks the same on each side when cutting down the middle. The total area under the curve is 1. Each normal distribution has its own mean and its standard deviation.

In the above binomial distribution example if we can try to flip the coin 100 times we will start seeing normal distribution:-

One very special member of the normal distribution family is called the standard normal distribution, or Z-distribution. The standard normal (Z ) distribution has a mean of zero and a standard deviation of 1. A z-score, in simple terms, is a score that expresses the value of distribution in standard deviation with respect to the mean. The general formula for z-score is X — μ∕σ, You take your x-value, subtract the mean, and divide by the standard deviation; this gives you its corresponding z-score.

To calculate the z-score we import a hypothetical data set having height and weight of 100 random people.

import pandas as pd

sample = pd.read_excel(“C:/Data/Python/sample.xlsx”)

sample

Person Weight Height

0 P1 71 180

1 P2 60 185

2 P3 74 183

3 P4 52 185

4 P5 56 180

5 P6 54 165

…………………………….

Using the given formula we can calculate the z-score as follows

sample[‘wt_ZScore’] = (sample[‘Weight’] — sample[‘Weight’].mean())/sample[‘Weight’].std(ddof=0)

sample[‘ht_ZScore’] = (sample[‘Height’] — sample[‘Height’].mean())/sample[‘Height’].std(ddof=0)

Person Weight Height wt_ZScore ht_ZScore

0 P1 71 180 0.705215 1.149901

1 P2 60 185 -0.506874 1.706485

2 P3 74 183 1.035785 1.483851

3 P4 52 185 -1.388393 1.706485

4 P5 56 180 -0.947633 1.149901

………………………………………………………………..

In Python, we can also use stats.zscore to get z-score.

stats.zscore(sample[‘Weight’])

0.70521543, -0.50687359, 1.03578516, -1.38839287, -0.94763323,

So, a person with a weight of 74 has a z-score of 1.035785.

A Poisson distribution is the probability distribution of independent interval occurrences in an interval. A binomial distribution is used to determine the probability of binary occurrences, whereas, a Poisson distribution is used for count-based distributions. For Poisson we can use the Poisson function from SciPy:

from scipy.stats import poisson

rv = poisson(20)

rv.pmf(23)

The output is as follows:-

6.157510184336505e-11

There are various kinds of probability distributions, and each distribution gives the probability of each event associated with a random experiment outcome. There are various methods for specifying the probability distribution:-

Probability mass function( i.e. for discrete)

Probability density function(i.e. for continuous)

Cumulative distribution function

Let us start with cumulative distribution function (CDF) is a function Fx : R -> [0; 1] which specifies a probability measure as Fx(x) = P(X <=x).By using this function one can calculate the probability of any event, below is sample CDF function

CDF properties:-

When a random variable X takes on a finite set of possible values (i.e., X is a discrete random variable), a simpler way to represent the probability measure associated with a random variable is to directly specify the probability of each value that the random variable can assume. In particular, a probability mass function (PMF) is a function pX : Ω -> R such that pX(x) = P(X = x).

PMF Properties:-

For some continuous random variables, the cumulative distribution function FX(x) is differentiable everywhere. In these cases, we define the Probability Density Function or PDF as the derivative of the CDF.Note here, that the PDF for a continuous random variable may not always exist (i.e., if FX(x) is not differentiable everywhere).Both CDFs and PDFs (when they exist!) can be used for calculating the probabilities of different events. But it should be emphasized that the value of PDF at any given point x is not the probability of that event. For example, fX(x) can take on values larger than one (but the integral of fX(x) over any subset of R will be at most one).

PDF Properties:-

A hypothesis test is a statistical procedure that’s designed to test a claim. Typically, the claim is being made about a population parameter (one number that characterizes the entire population). Because parameters tend to be unknown quantities, everyone wants to make claims about what their values may be. Every hypothesis test contains two hypotheses. The first

The first hypothesis is called the null hypothesis, denoted Ho. The null hypothesis always states that the population parameter is equal to the claimed value. For example, if the claim is that the average time to make a name-brand ready-mix pie is five minutes, the statistical shorthand notation for the null hypothesis, in this case, would be as follows: Ho: μ = 5. But, if the null hypothesis is found not to be true, what’s your alternative going to be? That’s your second (or alternative) hypothesis is which get accepted/rejected as per your hypothesis testing, it is denoted as Ha and in this case he population parameter is not equal to the claimed value (Ha: μ ≠ 5)is our alternate hypothesis.

In general, when hypothesis testing, you set up Ho and Ha so that you believe Ho is true unless your evidence (your data and statistics) show you otherwise. And in that case, where you have sufficient evidence against Ho, you reject Ho in favor of Ha. The burden of proof is on the researcher to show sufficient evidence against Ho before it’s rejected. If Ho is rejected in favor of Ha, the researcher can say he or she has found a statistically significant result; that is, the results refute the previous claim, and something different or new is happening.

To test whether the claim is true, you’re looking at your test statistic taken from your sample, and seeing whether it supports the claim. And how do you determine that? For the same we use p-value which is the probability of rejecting a null-hypothesis when the hypothesis is proven true, p-value is the measure of the strength of the evidence against the null hypothesis. If the p-value is equal to or less than the alpha level (α), then the null hypothesis is inconsistent and it needs to be rejected. To make a proper decision about whether or not to reject Ho, you determine your cutoff probability for your p-value before

Doing a hypothesis test; this cutoff is called an alpha level (α). Typical values for α are 0.05 or 0.01.

in Nutshell are the steps for hypothesis testing:-

Set up the null and alternative hypotheses: Ho and Ha.

Take a random sample of individuals from the population and calculate the sample statistics (means and standard deviations).

Convert the sample statistic to a test statistic by changing it to a standard score.

Find the p-value for your test statistic.

Examine your p-value and make your decision.

Let’s define the alpha level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to have 70kg weight. We will use our hypothetical dataset.

Let’s get the z-score of 70 kg weight:

zscore = ( 70 — sample[‘Weight’].mean() ) / sample[‘Weight’].std()

zscore

Out[62]: 0.592042914843196

We can use cdf function in Python to get the p-value:-

prob = 1 — stats.norm.cdf(zscore)

prob

Out[64]: 0.2769109257514748

Other tests such as T-Test, Chi-Square, or ANOVA can be also used to test whether a hypothesis about the mean is true or not based on the distribution of the data.

But now you could be wrong in your hypothesis testing and that leads to the errors in Hypothesis Testing. There are two type errors:-

Type-1:- If you conclude that a claim isn’t true but it actually is true — means rejecting Ho when you shouldn’t and it is called a Type-1 error or say this error as a false alarm.

Type-2:- If you conclude that a claim is true but it actually isn’t, — means Not rejecting Ho when you should have is called a Type-2 error or say as missed detection.

To avoid these errors we follow below best practices:-

Set a low cutoff probability for rejecting (minimizing Type-1 errors).

Select a large sample size to ensure that any differences that really exist won’t be missed (minimizing Type-2 errors).

A confidence interval (CI) is used for the purpose of estimating a population parameter by using statistics. The confidence interval helps in determining the interval at which the population mean can be defined. For example, you might estimate the average household income (parameter) based on the average household income from a random sample of 1,000 homes (statistic). However, because sample results will vary you need to add a measure of that variability to your estimate. This measure of variability is called the margin of error, the heart of a confidence interval. Your sample statistic, plus or minus your margin of error, gives you a range of likely values for the parameter — in other words, a confidence interval.

The t-distribution can be thought of as a cousin of the normal distribution — it looks similar to a normal distribution in that it has a basic bell shape with an area of 1 under it, but is shorter and flatter than a normal distribution. Like the standard normal (Z) distribution, it is centered at zero, but its standard deviation is proportionally larger compared to the Z-distribution. The t distribution is usually used to analyze the population when the sample is small.

Correlation defines the similarity between two random variables We use the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. The correlation coefficient for a sample of data is denoted by r. The most commonly used correlation is the Pearson correlation as the expected mean of the sum of the multiplied difference of random variables with respect to the mean divided by the standard deviation of x and y.The correlation r is always between +1 and -1. Here is how you interpret various values of r. A correlation that is

Exactly -1 indicates a perfect negative linear relationship.

Close to -1 indicates a strong negative linear relationship.

Close to 0 means no linear relationship exists.

Close to +1 indicates a strong positive linear relationship.

Exactly +1 indicates a perfect positive linear relationship.

corr() function gives use the correlation coefficients

sample.corr()

Please remember that correlation doesn’t necessarily mean cause-and-effect. A cause-and-effect relationship is one where a change in X causes a change in Y.

I will conclude this post here and in this, I have tried to explain most of the theory along with practical implementations of various Inferential Statistics concepts in Python. Through this post, I have tried to put most of the elements around Inferential Statistics in one place for a quick future reference guide.

Originally published at https://www.linkedin.com.

--

--