I came across a helpful account recently of an issue which has been bothering me recently, which is the interpretation of significance tests. It was in a slightly unexpected place – the GraphPad software online statistics guide:

http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_more_misunderstandings_of_p_va.htm

The issue is about how you interpret a “significant” p-value. Say you compare a drug to placebo to see if it cures people, and you get a “significant” effect (p < 0.05). Does that mean the drug works? Not necessarily. Apart the obvious 5% of occasions when you will get a “significant” effect when the drug does nothing, it also depends on the prior probability that a drug is effective. It’s exactly the same issue as with diagnostic tests, where the prevalence of a disease has a huge effect on the positive predictive value of a test. If a disease is very rare, even a test with extremely high sensitivity and specificity can be essentially useless, because almost all of the positives will be false positives.

So it is with trials. If your trial has 80% power, and a 5% Type I error rate, then if the prior probability of a drug being effective is 80% then in 1000 replicates of the experiment you will get:

Prior probability=80% |
|||

Drug really works | Drug really doesn’t work | Total | |

P<0.05, “significant” | 640 | 10 | 650 |

P>0.05, “not significant” | 160 | 190 | 350 |

Total | 800 | 200 | 1000 |

So in 640/650 (98.46%) occasions where you get a “significant” result, the drug will really be effective. [It would also be effective in nearly half of the experiments with a “non-significant” result (160/350).]

However, if there is only a 10% chance that the drug really works, things look a lot worse.

Prior probability=10% |
|||

Drug really works | Drug really doesn’t work | Total | |

P<0.05, “significant” | 80 | 45 | 125 |

P>0.05, “not significant” | 20 | 855 | 875 |

Total | 100 | 900 | 1000 |

Now the drug is only really effective in 64% of trials with a “significant” result. With 1% prior probability of the drug’s effectiveness, it really works in only 14% of trials with “significant” results.

So the prior probability of the treatment’s effectiveness is absolutely crucial in interpretation of the results of trials. But I don’t think I have ever seen this mentioned in the results or discussion of a paper. I’m really not sure how you would go about downgrading your confidence in a frequentist result based on the prior probability; there isn’t a mechanism for doing this. But this is undoubtedly a major cause of misinterpretation of trial results. When you consider that most trials have pretty low power (maybe 50-60% at best) to detect realistic treatment effects, and that the majority of interventions that are tested probably don’t work (maybe at best 20% are effective?), then the false positive rate is going to be substantial.

This is another way in which Bayesian methods score over standard traditional analyses; they force us to consider the prior probabilities of hypotheses, and to include them explicitly in the analysis. The issue seems always to be swept under the carpet in traditional analyses, with potentially disastrous consequences. Actually, saying it is swept under the carpet is probably inaccurate – most people are completely unaware that this is even an issue.

*Originally posted at http://blogs.warwick.ac.uk/simongates/entry/significance_testing_and/ on 3 December 2013.*