Skip to main content

What is P Value ?

In Data Science interviews, one of the frequently asked questions is ‘What is P-Value?”.

According to American Statistical Association,
“A p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” That’s hard to grasp, yes?
Alright, lets understand what really is p value in small meaningful pieces to make it very clear.
When and how is p-value used?
To understand p-value, you need to understand some background and context behind it. So, let’s start with the basics.
p-values are often reported whenever you perform a statistical significance test (like t-test, chi-square test etc). These tests typically return a computed test statistic and the associated p-value. This reported value is used to establish the statistical significance of the relationships being tested.So, whenever you see a p-value, there is an associated statistical test.That means, there is a Hypothesis testing being conducted with a defined Null Hypothesis (H0) and a corresponding Alternate hypothesis (HA).
The p-value reported is used to make a decision on whether the null hypothesis being tested can be rejected or not.Let’s understand a little bit more about the null and alternate hypothesis.
Now, how to frame a Null hypothesis in general?
While the null hypothesis itself changes with every statistical test, there is a general principle to frame it:
The null hypothesis assumes there is ‘no effect’ or ‘relationship’ by default.
For example: if you are testing if a drug treatment is effective or not, then the null hypothesis will assume there is not difference in outcome between the treated and untreated groups. Likewise, if you are testing if one variable influences another (say, car weight influences the mileage), then null hypothesis will postulate there is no relationship between the two.It simply implies the absence of an effect.
Welch Two Sample t-Test: The true difference in means of two samples is equal to 0Here are some examples of Null hypothesis (H0) for popular statistical tests:
  • Linear Regression: The beta coefficient(slope) of the X variable is zero
  • Chi Square test: There is no difference between expected frequencies and observed frequencies.
Get the feel?
But how would the alternate hypothesis would look like?
The alternate hypothesis (HA) is always framed to negate the null hypothesis. The corresponding HA for above tests are as follows:
  • Welch Two Sample t-Test: The true difference in means of two samples is NOT equal to 0
  • Linear Regression: The beta coefficient(slope) of the X variable is NOT zero
  • Chi Square test: The difference between expected frequencies and observed frequencies is NOT zero.
Now, back to the discussion on p-value.
Along with every statistical test, you will get a corresponding p-value in the results output. What is this meant for?
It is used to determine if the data is statistically incompatible with the null hypothesis. Let me put it in another way.
The P Value basically helps to answer the question: ‘Does the data really represent the observed effect?’.
This leads us to a more mathematical definition of P-Value.
The P Value is the probability of seeing the effect(E) when the null hypothesis is true.
p-value formula
If you think about it, we want this probability to be very low.
Having said that, it is important to remember that p-value refers to not only what we observed but also observations more extreme than what was observed. That is why the formal definition of p-value contain the statement ‘would be equal to or more extreme than its observed value.’
A sufficiently low value is required to reject the null hypothesis.Now that you know, p value measures the probability of seeing the effect when the null hypothesis is true.
Notice how I have used the term ‘Reject the Null Hypothesis’ instead of stating the ‘Alternate Hypothesis is True’.
That’s because, we have tested the effect against the null hypothesis only.
So, when the p-value is low enough, we reject the null hypothesis and conclude the observed effect holds.
But how low is ‘low enough’ for rejecting the null hypothesis?
This level of ‘low enough’ cutoff is called the alpha level, and you need to decide it before conducting a statistical test.
But how low is ‘low enough’
Let’s first understand what is Alpha level.
It is the cutoff probability for p-value to establish statistical significance for a given hypothesis test.
For an observed effect to be considered as statistically significant, the p-value of the test should be lower than the pre-decided alpha value.
Typically for most statistical tests(but not always), alpha is set as 0.05.
In which case, it has to be less than 0.05 to be considered as statistically significant.
What happens if it is say, 0.051?
It is still considered as not significant. We do NOT call it as a weak statistical significant.
It is either black or white. There is no gray with respect to statistical significance.
Now, how to set the alpha level?
Well, the usual practice is to set it to 0.05.
But when the occurrence of the event is rare, you may want to set a very low alpha. The rarer it is, the lower the alpha.
For example in the CERN’s Hadron collider experiment to detect Higgs-Boson particles(which was very rare), the alpha level was set so low to 5 Sigma levels, which means a p value of less than 3 * 10^-7 is required reject the null hypothesis.
Whereas for a more likely event, it can go up to 0.1.
Secondly, more the samples (number of observations) you have the lower should be the alpha level. Because, even a small effect can be made to produce a lower p-value just by increasing the number of observations.
The opposite is also true, that is, a large effect can be made to produce high p value by reducing the sample size.
In case you don’t know how likely the event can occur, its a common practice to set it as 0.05. But, as a thumb rule, never set the alpha greater than 0.1.
Having said that the alpha=0.05 is mostly an arbitrary choice. Then why do most people still use p=0.05?
That’s because thats what is taught in college courses and being traditionally used by the scientific community and publishers.
What P value is not
Given the uncertainty around the meaning of p-value, it is very common to misinterpret and use it incorrectly.
Some of the common misconceptions are as follows:
  1. P-Value is the probability of making a mistake. Wrong!
  2. P-Value measures the importance of a variable. Wrong!
  3. P-Value measures the strength of an effect. Wrong!
A smaller p-value does not signify the variable is more important or even a stronger effect.
Why?
Because, like I mentioned earlier, any effect no matter how small can be made to produce smaller p-value only by increasing the number of observations (sample size).
Likewise, a larger value does not imply a variable is not important.
For a sound communication, it is necessary to report not just the p-value but also the sample size along with it. This is especially necessary if the experiments involve different sample sizes.
Secondly, making inferences and business decisions should not be based only on the p-value being lower than the alpha level.
Analysts should understand the business sense, understand the larger picture and bring out the reasoning before making an inference and not just rely on the p-value to make the inference for you.
Does this mean the p-value is not useful anymore?
Not really. It is a useful tool because it provides an objective standard for everyone to assess. Its just that you need to use it the right way..

Comments

Popular posts from this blog

Important Python Libraries for Data Science

Python is the most widely used programming language today. When it comes to solving data science tasks and challenges, Python never ceases to surprise its users. Most data scientists are already leveraging the power of Python programming every day. Python is an easy-to-learn, easy-to-debug, widely used, object-oriented, open-source, high-performance language, and there are many more benefits to Python programming.People in Data Science definitely know about the Python libraries that can be used in Data Science but when asked in an interview to name them or state its function, we often fumble up or probably not remember more than 5 libraries. Important Python Libraries for Data Science: Pandas NumPy SciPy Matplotlib TensorFlow Seaborn Scikit Learn Keras 1. Pandas Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib. With around 17,00 comments on GitH...

R vs Python: Who is the Winner according to me...!!

As a data scientist, you probably want and need to learn Structured Query Language, or SQL. SQL is the de-facto language of relational databases, where most corporate information still resides. But that only gives you the ability to retrieve the data — not to clean it up or run models against it — and that’s where Python and R come in.R and Python both share similar features and are the most popular tools used by data scientists. Both are open-source and henceforth free yet Python is structured as a broadly useful programming language while R is created for statistical analysis. A little background on R R was created by Ross Ihaka and Robert Gentleman — two statisticians from the University of Auckland in New Zealand. It was initially released in 1995 and they launched a stable beta version in 2000. It’s an interpreted language (you don’t need to run it through a compiler before running the code) and has an extremely powerful suite of tools for statistical modeling and graphing...

Machine Learning Interview Questions - Part 1

Q1. What is Machine Learning? Machine Learning  explores the study and construction of algorithms that can learn from and make predictions on data.  Closely related to computational statistics.  Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics. Given below, is an image representing the various domains Machine Learning lends itself to. Q2. What is Supervised Learning? Supervised learning  is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas. Q3. What is Unsu...