On literature pollution and cottage-industry science
A few days ago there was a minor
Twitterstorm over a particular paper that claimed to have found an imaging
biomarker that was predictive of some aspect of outcome in adults with autism.
The details actually don’t matter that much and I don’t intend to pick on that
study in particular, or even link to it, as it’s no worse than many that get
published. What it prompted, though, was more interesting – a debate on
research practices in the field of cognitive neuroscience and neuroimaging,
particularly relating to the size of studies required to address some research
questions and the scale of research operation they might entail.
What kicked off the debate was a question
of how likely the result they found was to be “real”; i.e., to represent a
robust finding that would replicate across future studies and generalise to
other samples of autistic patients. I made a fairly uncompromising prediction
that it would not replicate, which was based on the fact that the finding
derived from: a small sample (n=31, in this case, but split into two), an
exploratory study (i.e., not aimed at or constrained by any specific
hypothesis, so that group differences in pretty much any imaging parameter
would do) and lack of a replication sample (to test directly, with exactly the
same methodology, whether the findings from the study were robust, prior to
bothering anyone else with them).
The reason for my cynicism is twofold:
first, the study was statistically under-powered, and such studies are theoretically
more likely to generate false positives. Second, and more damningly, there have
been literally hundreds of similar studies published using neuroimaging measures
to try and identify signatures that would distinguish between groups of people
or predict the outcome of illness. For psychiatric conditions like autism or
schizophrenia I don’t know of any such “findings” that have held up. We still
have no diagnostic or prognostic imaging markers, or any other biomarkers for
that matter, that have either yielded robust insights into underlying
pathogenic mechanisms or been applicable in the clinic.
There is thus strong empirical evidence
that the small sample, exploratory, no
replication design is a sure-fire way of generating findings that are, essentially, noise.
This is by no means a problem only for
neuroimaging studies; the field of psychology is grappling with similar
problems and many key findings in cell biology have similarly failed to replicate. We have seen it before in genetics, too, during the “candidate gene
era”, when individual research groups could carry out a small-scale study
testing single-nucleotide polymorphisms in a particular gene for association
with a particular trait or disorder. The problem was the samples were typically
small and under-powered, the researchers often tested multiple SNPs, haplotypes
or genotypes but rarely corrected for such multiple tests, and they usually did
not include a replication sample. What resulted was an entire body of
literature hopelessly polluted by false positives.
This problem was likely heavily compounded
by publication bias, with negative findings far less likely to be published. There
is evidence that that problem exists in the neuroimaging literature too,
especially for exploratory studies. If you are simply looking for some group
difference in any of hundreds or thousands of possible imaging parameters, then
finding one may be a (misplaced) cause for celebration, but not finding one is
hardly worthy of writing up.
In genetics, the problems with the
candidate gene approach were finally realised and fully grappled with. The
solution was to perform unbiased tests for SNP associations across the whole
genome (GWAS), to correct rigorously for the multiple tests involved, and to always
include a separate replication sample prior to publication. Of course, to
enable all that required something else: the formation of enormous consortia to
generate the sample sizes required to achieve the necessary statistical power
(given how many tests were being performed and the small effect sizes expected).
This brings me back to the reaction on
Twitter to the criticism of this particular paper. A number of people suggested
that if neuroimaging studies were expected to have larger samples and to also
include replication samples, then only very large labs would be able to afford
to carry them out. What would the small labs do? How would they keep their
graduate students busy and train them?
I have to say I have absolutely no sympathy
for that argument at all, especially when it comes to allocating funding. We
don’t have a right to be funded just so we can be busy. If a particular
experiment requires a certain sample size to detect an effect size in the expected
and reasonable range, then it should not be carried out without such a sample.
And if it is an exploratory study, then it should have a replication sample
built in from the start – it should not be left to the field to determine
whether the finding is real or not.
You might say, and indeed some people did
say, that even if you can’t achieve those goals, because the lab is too small
or does not have enough funding, at least doing it on a small scale is better
than nothing.
Well, it’s not. It’s worse than nothing.
Such studies just pollute the literature
with false positives – obscuring any real signal amongst a mass of surrounding
flotsam that future researchers will have to wade through. Sure, they keep
people busy, they allow graduate students to be trained (badly), and they
generate papers, which often get cited (compounding the pollution). But they
are not part of “normal science” – they do not contribute incrementally and
cumulatively to a body of knowledge.
We are no further in understanding the
neural basis of a condition like autism than we were before the hundreds of
small-sample/exploratory-design studies published on the topic. They have not
combined to give us any new insights, they don’t build on each other, they
don’t constrain each other or allow subsequent research to ask deeper
questions. They just sit there as “findings”, but not as facts.
Lest I be accused of being too preachy, I should
confess to some of these practices myself. Several years ago, while candidate
gene studies were still the norm, we published a paper that included a positive
association of semaphorin genes with schizophrenia (prompted by relevant
phenotypes in mutant mice). It seems quite likely now that that association was
a false positive, as a signal from the gene in question has not emerged in
larger genome-wide association studies.
And the one neuroimaging study I have done
so far, on synaesthesia, certainly suffered from a small sample size (at the
time it was considered decent), and no replication sample. In our defense, our study was itself designed as a replication of previous findings, combining
functional and structural neuroimaging. While our structural findings did
mirror those previously reported (in general direction and spatial distribution
of effects, though not precise regions), our functional results were quite
incongruent with previous findings. As we did not have a replication sample
built into our own design, I can’t be particularly confident that our findings
will generalise – perhaps they were a chance finding in a fairly small sample.
(Indeed, the imaging findings in synaesthesia have been generally quite
inconsistent and it is difficult to know which findings constitute real results
that future research studies could be built on).
If I were designing these kinds of studies
now I would use a very different design, with much larger samples and in-built
replication (and pre-registration). If that means they are more expensive, so
be it. If it means my group can’t do them alone, well that’s just going to be
the way it is. No one should fund me, or any lab, to do under-powered studies.
For the neuroimaging field generally that
may well mean embracing the idea of larger consortia and adopting common
scanning formats that enable combining subjects across centres, or at least
subsequent meta-analyses. And it will mean that smaller labs may have to give
up on the idea of making a living from studies attempting to find differences
between groups of people without enough subjects. You’ll find things – they
just won’t be real.
Great post! I think there should be more international collaboration through consortia for these very reasons.
ReplyDeleteEven if the study was adequately powered and a marginally significant result was obtained, does it really tell you much about autism? The crisis is more about random design than statistics and poor theoretical models.
ReplyDeleteEven if the study was adequately powered and a marginally significant result was obtained, does it really tell you much about autism? The crisis is more about random design than statistics and poor theoretical models. No amount of p-enhancement can tell you what is worth looking for.
ReplyDeleteis it really "worse than nothing"? That's a strong claim. I'd guess about 98% of neuroimaging studies since their inception in the late 70's involve samples smaller than 31. None of this has contributed cumulatively at all? Just noise.
ReplyDeleteIt is important to note that power is not just a function of sample size. Many factors impact power including variability, measurement error, significance level, choice of statistical model, assumptions of statistical model, effect size, sample design, genetic architecture, etc. I could go on and on. Sample size is important but so are all of these other factors that are almost always ignored in both candidate gene studies and GWAS.
ReplyDeleteGreat post!
ReplyDeleteOften our sample sizes are small because clinical populations are hard to recruit. What can we do to make the best of this situation?
You mentioned one: pre-registration. In the same vein, I guess the more specific a prediction the better. Similarly, the more reliable a measure, the better (so I should be pre-calibrating my experiment on a non-clinical sample to ensure I know the reliability of the measures I plan to use, and can add that into my power analysis). Anything else?
(yes, I take the point that underpowered studies shouldn't be funded, but doesn't that make ideas about improving underpowered studies even more relevant? I'm not going to get funded for this pilot study, but I still want to get the most evidential value out of it!)
Thanks Tom. Yes, there are several things that could be done to improve the situation. Definitely pre-calibrating your measures is a key way to tell what sample size you should use, to have power to detect an effect of a certain size against a background of whatever variance you observe.
DeleteIn the studies themselves, bigger is better, generally speaking, unless of course there's a lot of hidden heterogeneity, in which case adding in more heterogeneous cases (eg of "autism") won't necessarily help! That's a whole other level of problem - for another blog I think!
But given that, all the more reason, if you find some effect in an exploratory analysis, to have a replication sample ready to test it on! Otherwise you simply won't know if it's a generalisable finding or not.
good post
ReplyDeleteAs an Autism experienced parent I am delighted to see an academic pointing out the scientific flaws in the type of studies that proliferate in our field, generating so much empty vessel amplifying noise in the media. With the argument of "giving small labs something to do" is raised - why are the academics deciding what needs to be studied? Why not listen to the community and engage with people who live and love with Autism? I'm not talking about huge organisations who have invested millions of donated funds in trying to find a "smoking gun" while contributing to an atmosphere of fear and hate. I'm talking about recruiting and surveying people themselves. This has to include self advocates and those who need family or others to advocate on their behalf. As a parent I am so fed up with post grads doing meaningless research which contributes zero to improving the lives of people with autism. It has no effect on policy other than to give a few outspoken politicians a chance to rush off a quick press release and try get a sound bite while hitching a ride on the latest blame wagon.
ReplyDeleteWow, great post.
ReplyDelete