*Originally posted on http://blogs.warwick.ac.uk/simongates/entry/whats_wrong_with/ on 14 November 2012*

This is my own personal list of the failings of null hypothesis significance testing. The subject has been covered in detail by numerous authors in statistical, medical and psychological literature among other places (references – yes, I’ll add them when I get a minute), yet still hypothesis testing and the p<0.05 culture persists. I will add more points and explanations to the list as time goes on. So, for now, and in no particular order:

1. Significance level is arbitrary. There is no particular reason for choosing p=0.05 as the threshold for “significance” except that Ronald Fisher mentioned it in his early writings on p-values. I wouldn’t want to contradict Sir Ronald on a statistical matter, as he was far cleverer than I am, but he wasn’t advocating p=0.05 as a universal threshold. But this is more or less what it has become. In fact p=0.05 doesn’t represent strong evidence against the null hypothesis.

2. Artificial dichotomy. The division of results into “significant” and “non-significant” encourages erroneous dichotomous thinking; the belief that a “significant” results is real and important, whereas non-significant means there is no effect. None of this is correct. However, this dichotomy is precisely what is required by the Neyman-Pearson hypothesis testing procedure.

3. P-values usually misinterpreted. There is empirical evidence that many or most researchers do not understand what p-values actually mean. There are several published surveys that show that many people are unable to identify true and false statements about p-values. This is more than a bit worrying, when they are widely used to draw conclusions from research.

4. P-values do not tell us what we want to know. The p value is the probability of getting the data (or more extreme data) if there is really zero difference (i.e. prob(data|no difference)). This is not something that we are usually very interested in knowing. Much more relevant is the probability that there is really no difference, given the data that have been observed (prob(no difference|data)) i.e. given the results obtained, how likely is it that there is really no difference. Even more relevant, we want to know things like; given the results observed, how likely is it that there is a clinically important difference? Or; how big is the difference and what is the uncertainty around this?

5. “Statistical significance” is not the same as clinical or scientific significance. It is quite possible (and common) to get a clinically important result that is not statistically significant, and equally (though less common) to have a clinically unimportant result that is statistically significant. This is because p-values depend on the sample size as well as the size of the difference. With a big enough sample size, any difference can be made statistically significant.

6. Calculating a p-value under the assumption that the null hypothesis is true makes little sense because the null hypothesis is almost always false. Two different treatments, or a treatment and placebo or no treatment, will extremely rarely have exactly the same effect on an outcome. The exceptions may be treatments that are known to do absolutely nothing, like homeopathy or reflexology, but even here exactly the same effect would only be expected if the trials could be properly blinded to eliminate the effects of attention from the therapist.

7. P-values are uninformative. Even if you still think that a significance test is a reasonable thing to do, it tells you very little, and nothing that is useful. All it tests is whether a difference is non-zero; it gives no information about the size of the difference or the uncertainty around it, nor even the strength of evidence against the null hypothesis.

8. Instability of p-values on replication. It is a little-appreciated fact that if an experiment is repeated , a quite different p-value can result, and any obtained p-value is only a poor predictor of future p-values. This is less so for extremely small p-values, but more so as they approach the significance threshold.

9. There are several misconceptions and misinterpretations that are frequently made. One of the commonest is that a p-value of less than 0.05 means that the difference is unlikely to be due to chance. But the p-value is calculated on the ASSUMPTION that there is no difference so it obviously cannot say anything about whether or not this assumption is true. For that you need to know how likely the null hypothesis is.

10. Another is that the p-value is the probability that the null hypothesis is true (and hence that 1-p is the probability of the alternative hypothesis is true). Neither of these probabilities has anything to do with the p-value.

11. Yet another common logical fallacy is that if the null hypothesis is not rejected, it is accepted as true. This is the same error as assuming that treatments are the same if they are not found to be “significantly” different – but of course “absence of evidence is not evidence of absence”. Clearly a comparison can give a nonsignificant result for reasons other than the null hypothesis being true.

12. The p-value depends not only on the data, but also on the intention of the experiment. Hence the same set of data can give rise to widely differing p-values, depending on what the intention was when the data were collected (how many subjects were to be included, how many comparisons were to be made, etc). This makes very little intuitive sense. Some good examples are given by Goodman (1999) and Kruschke (2010). The familiar adjustment of p-values for multiple comparisons is one manifestation of this phenomenon.

Goodman SN. Towards evidence based medical statistics. 1. The p-value fallacy. Annals of Internal Medicine 1999; 130:995-1004

Kruschke, J. Bayesian data analysis. WIREs Cognitive Science 2010; 5(1). DOI: 10.1002/wcs.72