On literature pollution and cottage-industry science

By Kevin Mitchell - December 03, 2015

A few days ago there was a minor Twitterstorm over a particular paper that claimed to have found an imaging biomarker that was predictive of some aspect of outcome in adults with autism. The details actually don’t matter that much and I don’t intend to pick on that study in particular, or even link to it, as it’s no worse than many that get published. What it prompted, though, was more interesting – a debate on research practices in the field of cognitive neuroscience and neuroimaging, particularly relating to the size of studies required to address some research questions and the scale of research operation they might entail.

What kicked off the debate was a question of how likely the result they found was to be “real”; i.e., to represent a robust finding that would replicate across future studies and generalise to other samples of autistic patients. I made a fairly uncompromising prediction that it would not replicate, which was based on the fact that the finding derived from: a small sample (n=31, in this case, but split into two), an exploratory study (i.e., not aimed at or constrained by any specific hypothesis, so that group differences in pretty much any imaging parameter would do) and lack of a replication sample (to test directly, with exactly the same methodology, whether the findings from the study were robust, prior to bothering anyone else with them).

The reason for my cynicism is twofold: first, the study was statistically under-powered, and such studies are theoretically more likely to generate false positives. Second, and more damningly, there have been literally hundreds of similar studies published using neuroimaging measures to try and identify signatures that would distinguish between groups of people or predict the outcome of illness. For psychiatric conditions like autism or schizophrenia I don’t know of any such “findings” that have held up. We still have no diagnostic or prognostic imaging markers, or any other biomarkers for that matter, that have either yielded robust insights into underlying pathogenic mechanisms or been applicable in the clinic.

There is thus strong empirical evidence that the small sample, exploratory, no replication design is a sure-fire way of generating findings that are, essentially, noise.

This is by no means a problem only for neuroimaging studies; the field of psychology is grappling with similar problems and many key findings in cell biology have similarly failed to replicate. We have seen it before in genetics, too, during the “candidate gene era”, when individual research groups could carry out a small-scale study testing single-nucleotide polymorphisms in a particular gene for association with a particular trait or disorder. The problem was the samples were typically small and under-powered, the researchers often tested multiple SNPs, haplotypes or genotypes but rarely corrected for such multiple tests, and they usually did not include a replication sample. What resulted was an entire body of literature hopelessly polluted by false positives.

This problem was likely heavily compounded by publication bias, with negative findings far less likely to be published. There is evidence that that problem exists in the neuroimaging literature too, especially for exploratory studies. If you are simply looking for some group difference in any of hundreds or thousands of possible imaging parameters, then finding one may be a (misplaced) cause for celebration, but not finding one is hardly worthy of writing up.

In genetics, the problems with the candidate gene approach were finally realised and fully grappled with. The solution was to perform unbiased tests for SNP associations across the whole genome (GWAS), to correct rigorously for the multiple tests involved, and to always include a separate replication sample prior to publication. Of course, to enable all that required something else: the formation of enormous consortia to generate the sample sizes required to achieve the necessary statistical power (given how many tests were being performed and the small effect sizes expected).

This brings me back to the reaction on Twitter to the criticism of this particular paper. A number of people suggested that if neuroimaging studies were expected to have larger samples and to also include replication samples, then only very large labs would be able to afford to carry them out. What would the small labs do? How would they keep their graduate students busy and train them?

I have to say I have absolutely no sympathy for that argument at all, especially when it comes to allocating funding. We don’t have a right to be funded just so we can be busy. If a particular experiment requires a certain sample size to detect an effect size in the expected and reasonable range, then it should not be carried out without such a sample. And if it is an exploratory study, then it should have a replication sample built in from the start – it should not be left to the field to determine whether the finding is real or not.

You might say, and indeed some people did say, that even if you can’t achieve those goals, because the lab is too small or does not have enough funding, at least doing it on a small scale is better than nothing.

Well, it’s not. It’s worse than nothing.

Such studies just pollute the literature with false positives – obscuring any real signal amongst a mass of surrounding flotsam that future researchers will have to wade through. Sure, they keep people busy, they allow graduate students to be trained (badly), and they generate papers, which often get cited (compounding the pollution). But they are not part of “normal science” – they do not contribute incrementally and cumulatively to a body of knowledge.

We are no further in understanding the neural basis of a condition like autism than we were before the hundreds of small-sample/exploratory-design studies published on the topic. They have not combined to give us any new insights, they don’t build on each other, they don’t constrain each other or allow subsequent research to ask deeper questions. They just sit there as “findings”, but not as facts.

Lest I be accused of being too preachy, I should confess to some of these practices myself. Several years ago, while candidate gene studies were still the norm, we published a paper that included a positive association of semaphorin genes with schizophrenia (prompted by relevant phenotypes in mutant mice). It seems quite likely now that that association was a false positive, as a signal from the gene in question has not emerged in larger genome-wide association studies.

And the one neuroimaging study I have done so far, on synaesthesia, certainly suffered from a small sample size (at the time it was considered decent), and no replication sample. In our defense, our study was itself designed as a replication of previous findings, combining functional and structural neuroimaging. While our structural findings did mirror those previously reported (in general direction and spatial distribution of effects, though not precise regions), our functional results were quite incongruent with previous findings. As we did not have a replication sample built into our own design, I can’t be particularly confident that our findings will generalise – perhaps they were a chance finding in a fairly small sample. (Indeed, the imaging findings in synaesthesia have been generally quite inconsistent and it is difficult to know which findings constitute real results that future research studies could be built on).

If I were designing these kinds of studies now I would use a very different design, with much larger samples and in-built replication (and pre-registration). If that means they are more expensive, so be it. If it means my group can’t do them alone, well that’s just going to be the way it is. No one should fund me, or any lab, to do under-powered studies.

For the neuroimaging field generally that may well mean embracing the idea of larger consortia and adopting common scanning formats that enable combining subjects across centres, or at least subsequent meta-analyses. And it will mean that smaller labs may have to give up on the idea of making a living from studies attempting to find differences between groups of people without enough subjects. You’ll find things – they just won’t be real.

Comments

UnknownDecember 4, 2015 at 8:22 AM
Great post! I think there should be more international collaboration through consortia for these very reasons.
ReplyDelete
Replies
UnknownDecember 4, 2015 at 3:45 PM
Even if the study was adequately powered and a marginally significant result was obtained, does it really tell you much about autism? The crisis is more about random design than statistics and poor theoretical models.
ReplyDelete
Replies
UnknownDecember 4, 2015 at 3:47 PM
Even if the study was adequately powered and a marginally significant result was obtained, does it really tell you much about autism? The crisis is more about random design than statistics and poor theoretical models. No amount of p-enhancement can tell you what is worth looking for.
ReplyDelete
Replies
Brad BuchsbaumDecember 5, 2015 at 10:08 AM
is it really "worse than nothing"? That's a strong claim. I'd guess about 98% of neuroimaging studies since their inception in the late 70's involve samples smaller than 31. None of this has contributed cumulatively at all? Just noise.

ReplyDelete
Replies
Jason H. Moore, Ph.D.December 9, 2015 at 4:34 AM
It is important to note that power is not just a function of sample size. Many factors impact power including variability, measurement error, significance level, choice of statistical model, assumptions of statistical model, effect size, sample design, genetic architecture, etc. I could go on and on. Sample size is important but so are all of these other factors that are almost always ignored in both candidate gene studies and GWAS.
ReplyDelete
Replies
Tom StaffordFebruary 24, 2016 at 11:11 AM
Great post!

Often our sample sizes are small because clinical populations are hard to recruit. What can we do to make the best of this situation?

You mentioned one: pre-registration. In the same vein, I guess the more specific a prediction the better. Similarly, the more reliable a measure, the better (so I should be pre-calibrating my experiment on a non-clinical sample to ensure I know the reliability of the measures I plan to use, and can add that into my power analysis). Anything else?

(yes, I take the point that underpowered studies shouldn't be funded, but doesn't that make ideas about improving underpowered studies even more relevant? I'm not going to get funded for this pilot study, but I still want to get the most evidential value out of it!)
ReplyDelete
Replies
zainMarch 10, 2016 at 3:32 AM
good post
ReplyDelete
Replies
LisamareeApril 4, 2016 at 11:22 AM
As an Autism experienced parent I am delighted to see an academic pointing out the scientific flaws in the type of studies that proliferate in our field, generating so much empty vessel amplifying noise in the media. With the argument of "giving small labs something to do" is raised - why are the academics deciding what needs to be studied? Why not listen to the community and engage with people who live and love with Autism? I'm not talking about huge organisations who have invested millions of donated funds in trying to find a "smoking gun" while contributing to an atmosphere of fear and hate. I'm talking about recruiting and surveying people themselves. This has to include self advocates and those who need family or others to advocate on their behalf. As a parent I am so fed up with post grads doing meaningless research which contributes zero to improving the lives of people with autism. It has no effect on policy other than to give a few outspoken politicians a chance to rush off a quick press release and try get a sound bite while hitching a ride on the latest blame wagon.
ReplyDelete
Replies
David MillerMay 13, 2018 at 9:05 AM
Wow, great post.
ReplyDelete
Replies