If you’re an academic desperate to publish or a consulting firm trying to find a politically convenient result, you sometimes run into the unpleasant reality that the world or at least the data you’ve gathered, doesn’t support your hypothesis. However, there are plenty of tricks you can play with data to get statistically significant results even when the actual evidence doesn’t support your claim. For instance, you can test dozens of hypotheses and correlations, until one by chance looks statistically significant, in what is known as p-hacking or p-value hacking. P-values being statistical shorthand for how likely a result is to occur by chance which means that low p-values suggest but don’t prove that a relationship between two variables is “real” rather than chance.
When researchers want to find a result, they might turn to p-hacking — slicing and dicing data until something looks statistically significant. But you can also do the reverse: design an experiment so underpowered it can’t detect anything real.
No Power Analysis at the Wuhan Institute of Virology
After the COVID-19 pandemic, as suspicion grew that that EcoHealth Alliance, an environmental nonprofit, and its scientific partner, the Wuhan Institute of Virology, may have inadvertently caused the pandemic, the American Congress began investigating and interviewing key figures from EcoHealth, including Peter Daszak, then head of EcoHealth Alliance. He was asked why no one at either WIV or EcoHealth had flagged a concerning result- an engineered virus they made killed 6 of 8 mice, compared to 2 of 7 mice with the baseline strain. Daszak said there had been no need to worry because as Daszak insisted the experiment had only been “[w]ith one group, a very small number of mice. That is not statistically significant.”
Either Daszak misunderstood the statistics, or he hoped others would (A basic two-proportion z-test testing the hypothesis that the engineered virus is deadlier has a p-value of 0.04)
But he’s right that the experiment used a very small number of mice. As with much of the activity in Wuhan the reporting from EcoHealth and WIV is either mysterious or sloppy, e.g. when discussing different viruses tested on mice they only discuss the provide sample size of the baseline virus and the deadliest engineered virus, leading readers to assume that a similar sample size was used for the other engineered viruses.
The smaller the sample size, the harder it is to tell whether one virus is more deadly than another for the same reason that if you flip a coin three times, you won’t know whether it’s biased, but if you flip it a hundred times, you could make a much better guess. Scientists, or freshman taking research design courses, are well aware of this and can choose a sample size that will let them have a good chance of being able to detect real differences between two groups.
But with the 15 mice that Ecohealth and WIV used to test the difference in mortality between one of their engineered viruses and the baseline virus, we can only see that the engineered virus is deadlier because the difference is so dramatic. Had the engineered virus been slightly less deadly, Daszak could have honestly claimed there was no statistically significant difference between the viruses in terms of lethality in mice.
For example, if the engineered virus had a mortality rate of 50%, also much greater than the baseline virus, you would need a sample size of 168 mice to have even an 80% chance of finding a statistically significant difference between the two viruses. Of course we’ll never know why those researchers chose to use such a small sample size, but it’s impossible to implement sensible safeguards against high-risk research if scientists refuse to measure how dangerous their pathogens are.
But the Wuhan Institute of Virology is far from the only place to conduct gain-of-function research that chose sample sizes so small, it’s hard to see if the engineered viruses are more deadly than their natural predecessors. In 2018, a research team at University of Carolina that included future WIV collaborators, would create mutant coronaviruses and used groups of mice as small as five.
Please just let us know if these viruses are deadly
Some virus hunters seek out new natural viruses, either for pure scientific exploration or in hopes to prepare for emerging pathogens. This of course has risks- they could accidentally unleash a pandemic. And unfortunately some scientists insist on using tiny animal sample sizes for natural viruses as well. In 2018, a novel bat-based coronavirus was discovered in China and scientists tested the lethality of the virus in pigs by using a control group of five animals and a treatment group of seven animals.
This means that virus hunters are exposing themselves and other lab workers to new pathogens and not even taking basic steps to get good data out of their investigations. If this type of work can be justified at all, there is no excuse for doing so with insufficient sample size. Sometimes viruses are so dangerous we can observe than even in tiny sample sizes. But if we consistently use small sample sizes, we might be erroneously treating dangerous viruses as benign
This will raise the costs of virological research, more animals means more costs, which is fine. Science policy should not be set up to study dangerous pathogens on the cheap.
Since so much of this research is federally funded, the Trump administration has tools to step in. The administration recently put out an executive order calling for federal agencies to embrace “gold standard science” and demand scientific study that is “structured for falsifiability of hypotheses,” reproducible, and transparent. Tiny animal studies of viruses, natural or engineered, fail to live up to that. Any effort to measures the mortality rate of such a virus would have enormous error bars and it becomes very hard to tell if an engineered virus is more dangerous than the virus it was based off of.
The administration also can rescind funding that was already granted, as I recently argued in Tablet. Federal animal welfare policy allows the NIH Director to pull funding from research projects that fail to use the minimal number of animals to achieve scientifically valid results. NIH should do so. We shouldn’t be killing lab animals in studies that can’t even tell us what we want to know.
It’s regrettable that so many scientists are willing to pursue underpowered studies, especially when these studies could produce viruses whose dangers will be hard to detect. If virology had better norms, perhaps there would be no need to call for political solutions. But so far virologists have failed to impress with their behavior. Individual scientists and academic journals should also have the integrity to not propose such flawed studies, not participate in them, and not publish them. If you want to be a scientist, be a responsible one. Until that happens, we can only hope that politicians will be more responsible.
In semi-regulated, self-licking ice cream cone research, a lack of statistical power is a feature not a bug.
A lack of power is not enough to make publication impossible, and too much power means actually reporting the danger of what you're doing to the authorities.