In another of our monthly blog post written by Science Wordsmith Lucy Goodchild, we explore the findings and attention around a piece of research published in the previous month that caught the public’s attention. Listen to the accompanying podcast here.
A Nature commentary with the support of 850 signatories has generated the highest Altmetric attention score of all time.
What if our collective scientific knowledge was structured around an arbitrary threshold, a change to which could alter our assessment of scientific findings, shifting the basis on which policy is built and, ultimately, making us question what we currently believe to be true?
This is closer to reality than you may think: that arbitrary threshold is a statistical line – a P value of 0.05 – which is used in science to determine whether results are statistically significant. Despite its widespread use, there is growing concern that the concept of statistical significance based on the 0.05 (or any other) threshold is actually causing many problems: overconfidence in single studies, false impressions of results being in conflict and a tarnished literature.
In a commentary in Nature, Prof Valentin Amrhein of the University of Basel, Prof Sander Greenland of the University of California, Los Angeles, and Prof Blake McShane of Northwestern University’s Kellogg School of Management lead a group of 850 signatories calling for statistical significance based on this arbitrary threshold to be retired. And the support goes much further: with extensive global coverage, the article has the highest ever Altmetric attention score of over 12,800.
Read the paper: “Scientists rise up against statistical significance”
“The way it’s perceived is if you get statistical significance, your results are reliable and your results are true,” said Prof McShane. “The idea is it’s a safeguard against noise-chasing, but in reality, it fails to deliver on that promise. And a sharper problem is that it’s an arbitrary dichotomization of a continuous measure of evidence.”
What is statistical significance?
To understand the problems with statistical significance, it’s important to understand the concept… but that’s easier said than done. “It’s a complex subject,” said Prof McShane. “In fact, one of the big difficulties here is that what a P value quantifies is highly counterintuitive. It’s very easy to get it wrong and so mistakes are common.”
The P value measures the probability of finding the observed results (or more extreme results) if the null hypothesis is true. It’s a number between 0 and 1; if the P value is below 0.05, the result is traditionally labeled statistically significant, and if it’s above 0.05 it’s not statistically significant.
One of the problems with this is that 0.05 – or indeed any threshold – is arbitrary. Whatever the P value quantifies is quantified in a continuous fashion, and as Prof McShane and his colleagues explain in their paper, there’s no ontological reason or basis for such a sharp threshold.
Dichotomania, bias and overconfidence
This threshold and what it causes – namely the dichotomization of results into statistically significant and not statistically significant – is what the authors say needs to be retired. This so-called dichotomania has become so pervasive in science, as well as impacting law, policy and medicine, that it is shaping what we believe to be true, often leading to false conclusions.
“If you do not get a statistically significant result, that does not prove the null hypothesis – that there’s no difference between groups, or that some treatment doesn’t work,” said Prof McShane. “This is a fallacy, but it’s so common in scientific talks to see somebody compare two groups and say there was no difference because the difference did not attain statistical significance.”
The tendrils of dichotomania have led to many problems beyond this: misconceptions about statistical significance have contributed to the replicability crisis in the biomedical and social sciences. It has led people to believe two studies are in conflict when they are not. It has led to overconfidence in the results of single studies. And through the publication bias, it has resulted in a literature that is by definition not representative.
The upshot of this is that we often underestimate the magnitude of the natural variation in statistical results between honestly reported studies. “Scientists and the public expect statistically significant results to be repeatable and are thus (wrongly) disappointed when this is not always the case,” said Prof Valentin Amrhein. “This is not a problem of the statistical methods that we use, but of the way we interpret and communicate their output.”
While statistical significance sends the so-called significant results into the literature, the results on the other side of the threshold often disappear into the “famous file drawer,” according to Prof Amrhein, never to see the light of day.
The end for statistical significance
These aren’t recent revelations; the problems with statistical significance have been known for many decades. The paper’s first reference is a 1935 article in Nature by R. A. Fisher, the founder of modern statistics, highlighting that it’s a fallacy to conclude that there’s no difference if a P value is not statistically significant. And some 16 years earlier, Edwin Boring discussed similar problems in his paper “Mathematical vs. scientific significance.”
Almost a century later, in 2016, the American Statistical Association released a statement on statistical significance and P values, which was followed by a conference on statistical inference in 2017. Now a special issue of the journal The American Statistician brings together 43 papers on the topic. The Nature commentary was released in parallel.
“We’re not advocating for an anything-goes situation in which weak evidence suddenly becomes strong evidence,” said Prof McShane. Rather, they’re saying scientists need to embrace uncertainty rather than trying to remove it.
“Statistics—that is, the field of statistics—isn’t about removing uncertainty; what it really is about is quantifying uncertainty,” he added. “If we could just appreciate that more and embrace uncertainty and live with it, I think we’d see people making less strong claims one way or another for positive or negative findings and instead having a more nuanced view of research findings.”
It won’t be easy to break the “worrisome circularity” that exists around statistical significance – it’s taught because it’s used and it’s used because it’s taught. But there’s a lot of support for retiring the concept, and there are plenty of alternatives being proposed. The authors believe the best methods will be specific to different fields rather than being generic.
“Overall, we should accept methodologic diversity – there is no magic alternative to significance testing, and no single statistical method suits all purposes,” said Prof Amrhein. “If we would learn to place less faith in results with small P values and not to dismiss results with larger P values as being necessarily zero, then we think we would see more honest and full reporting, less significance chasing, less publication bias, and less inflated effect sizes in the published literature.”
Spreading the word
The message resonated widely: 30 news outlets, 35 blogs and more than 18,000 tweets contributed to an Altmetric score of over 12,800 – the highest ever. While the team sent a few messages to contacts and on social media, it was the attention of prominent bloggers that really helped to get the message out.
Discover the coverage: https://www.altmetric.com/details/57358237?src=bookmarklet
“We have been absolutely overwhelmed by the response,” said Prof Amrhein. “We certainly expected some discussion, so it is very good that there is now an intense debate about how we want to draw and formulate our scientific conclusions.”
Their hope is that the discussion leads to lasting change and an end to the P value threshold and statistical significance.
“On one hand, history would show that this is a losing battle,” said Prof McShane. “On the other hand, it does seem like maybe something’s different this time. Maybe there’s a chance to finally put the nail in the coffin for good on this one, or at least make people aware of and think about alternative approaches to the analysis and interpretation of their data.”
Prof McShane’s view on making a splash
“Pretty much everything we say in our article has been said for decades, if not a century. Obviously the message still needs to be heard because these errors are still very widespread. But we had no idea that it would generate this level of engagement and buzz, and that’s certainly not what we set out for. I think in general, a researcher’s focus should be on looking for the truth and publishing replicable findings. In some cases I think searching around for hot topics or to make a splash may actually be antithetical to the idea of finding the truth and publishing replicable findings. Single studies and single papers don’t prove anything definitively and we need to triangulate with many studies, many papers and different approaches before we make any firm conclusions.”