On literature pollution and cottage-industry science
A few days ago there was a minor Twitterstorm over a particular paper that claimed to have found an imaging biomarker that was predictive of some aspect of outcome in adults with autism. The details actually don’t matter that much and I don’t intend to pick on that study in particular, or even link to it, as it’s no worse than many that get published. What it prompted, though, was more interesting – a debate on research practices in the field of cognitive neuroscience and neuroimaging, particularly relating to the size of studies required to address some research questions and the scale of research operation they might entail.
What kicked off the debate was a question of how likely the result they found was to be “real”; i.e., to represent a robust finding that would replicate across future studies and generalise to other samples of autistic patients. I made a fairly uncompromising prediction that it would not replicate, which was based on the fact that the finding derived from: a small sample (n=31, in this case, but split into two), an exploratory study (i.e., not aimed at or constrained by any specific hypothesis, so that group differences in pretty much any imaging parameter would do) and lack of a replication sample (to test directly, with exactly the same methodology, whether the findings from the study were robust, prior to bothering anyone else with them).
The reason for my cynicism is twofold: first, the study was statistically under-powered, and such studies are theoretically more likely to generate false positives. Second, and more damningly, there have been literally hundreds of similar studies published using neuroimaging measures to try and identify signatures that would distinguish between groups of people or predict the outcome of illness. For psychiatric conditions like autism or schizophrenia I don’t know of any such “findings” that have held up. We still have no diagnostic or prognostic imaging markers, or any other biomarkers for that matter, that have either yielded robust insights into underlying pathogenic mechanisms or been applicable in the clinic.
There is thus strong empirical evidence that the small sample, exploratory, no replication design is a sure-fire way of generating findings that are, essentially, noise.
This is by no means a problem only for neuroimaging studies; the field of psychology is grappling with similar problems and many key findings in cell biology have similarly failed to replicate. We have seen it before in genetics, too, during the “candidate gene era”, when individual research groups could carry out a small-scale study testing single-nucleotide polymorphisms in a particular gene for association with a particular trait or disorder. The problem was the samples were typically small and under-powered, the researchers often tested multiple SNPs, haplotypes or genotypes but rarely corrected for such multiple tests, and they usually did not include a replication sample. What resulted was an entire body of literature hopelessly polluted by false positives.
This problem was likely heavily compounded by publication bias, with negative findings far less likely to be published. There is evidence that that problem exists in the neuroimaging literature too, especially for exploratory studies. If you are simply looking for some group difference in any of hundreds or thousands of possible imaging parameters, then finding one may be a (misplaced) cause for celebration, but not finding one is hardly worthy of writing up.
In genetics, the problems with the candidate gene approach were finally realised and fully grappled with. The solution was to perform unbiased tests for SNP associations across the whole genome (GWAS), to correct rigorously for the multiple tests involved, and to always include a separate replication sample prior to publication. Of course, to enable all that required something else: the formation of enormous consortia to generate the sample sizes required to achieve the necessary statistical power (given how many tests were being performed and the small effect sizes expected).
This brings me back to the reaction on Twitter to the criticism of this particular paper. A number of people suggested that if neuroimaging studies were expected to have larger samples and to also include replication samples, then only very large labs would be able to afford to carry them out. What would the small labs do? How would they keep their graduate students busy and train them?
I have to say I have absolutely no sympathy for that argument at all, especially when it comes to allocating funding. We don’t have a right to be funded just so we can be busy. If a particular experiment requires a certain sample size to detect an effect size in the expected and reasonable range, then it should not be carried out without such a sample. And if it is an exploratory study, then it should have a replication sample built in from the start – it should not be left to the field to determine whether the finding is real or not.
You might say, and indeed some people did say, that even if you can’t achieve those goals, because the lab is too small or does not have enough funding, at least doing it on a small scale is better than nothing.
Well, it’s not. It’s worse than nothing.
Such studies just pollute the literature with false positives – obscuring any real signal amongst a mass of surrounding flotsam that future researchers will have to wade through. Sure, they keep people busy, they allow graduate students to be trained (badly), and they generate papers, which often get cited (compounding the pollution). But they are not part of “normal science” – they do not contribute incrementally and cumulatively to a body of knowledge.
We are no further in understanding the neural basis of a condition like autism than we were before the hundreds of small-sample/exploratory-design studies published on the topic. They have not combined to give us any new insights, they don’t build on each other, they don’t constrain each other or allow subsequent research to ask deeper questions. They just sit there as “findings”, but not as facts.
Lest I be accused of being too preachy, I should confess to some of these practices myself. Several years ago, while candidate gene studies were still the norm, we published a paper that included a positive association of semaphorin genes with schizophrenia (prompted by relevant phenotypes in mutant mice). It seems quite likely now that that association was a false positive, as a signal from the gene in question has not emerged in larger genome-wide association studies.
And the one neuroimaging study I have done so far, on synaesthesia, certainly suffered from a small sample size (at the time it was considered decent), and no replication sample. In our defense, our study was itself designed as a replication of previous findings, combining functional and structural neuroimaging. While our structural findings did mirror those previously reported (in general direction and spatial distribution of effects, though not precise regions), our functional results were quite incongruent with previous findings. As we did not have a replication sample built into our own design, I can’t be particularly confident that our findings will generalise – perhaps they were a chance finding in a fairly small sample. (Indeed, the imaging findings in synaesthesia have been generally quite inconsistent and it is difficult to know which findings constitute real results that future research studies could be built on).
If I were designing these kinds of studies now I would use a very different design, with much larger samples and in-built replication (and pre-registration). If that means they are more expensive, so be it. If it means my group can’t do them alone, well that’s just going to be the way it is. No one should fund me, or any lab, to do under-powered studies.
For the neuroimaging field generally that may well mean embracing the idea of larger consortia and adopting common scanning formats that enable combining subjects across centres, or at least subsequent meta-analyses. And it will mean that smaller labs may have to give up on the idea of making a living from studies attempting to find differences between groups of people without enough subjects. You’ll find things – they just won’t be real.