On conceptual rigor: “What do we take ourselves to be doing?”
Dorothy’s efforts, along with many others in the Open Science “movement”, have helped to highlight crucial issues of methodological and statistical rigor that have led to irreproducibility across many fields, along with potential solutions and ways that the scientific community collectively can address these problems. These efforts are aimed at improving the quality of what we are doing. But there is sometimes a deeper question to be asked: why are we doing what we are doing? What is the conceptual basis for the research we’re conducting? Of course, this is often very well laid out and supported, but not always. There is frequently a need not just for more methodological and statistical rigor, but more conceptual rigor too.
Methodological and statistical problems
Problems of irreproducibility have arisen across fields, with social psychology garnering probably the most public attention. There is really nothing unique about this field, however – these problems are extremely widespread, as pointed out already in 2005 by John Ioannidis. They arise due to under-powered studies, small actual effect sizes, poorly defined hypotheses, exploratory analyses of a high number of variables, excessive degrees of freedom in analyses including mining through covariates, lack of correction for multiple tests, and lack of internal replication – effectively fishing for significance in noisy data. These practices, compounded by publication bias, are guaranteed to flood the literature with false positives – spurious findings that do not replicate.
In the study of neurodevelopmental disorders, these problems have manifested especially in genetics and neuroimaging. The issues with candidate gene association analyses are, by now, well known. The premise here is that if some trait or condition is heritable, then there must be some underlying genetic variation associated with it. Some of that variation may be in the form of common genetic variants in the population. You may have a hypothesis about what causes a certain condition – say, derived from knowledge of the relevant pharmacology – and that may lead you to select a number of “candidate genes” for genetic analysis (such as the serotonin transporter gene for anxiety or depression, or dopamine receptor genes for schizophrenia).
The next step is to identify some common genetic variants in your set of candidate genes and perform an “association analysis” – this amounts to assessing the frequency of the different versions in a set of cases and controls and looking for significant differences. If you find a significant increase of a particular variant in cases, you can infer that it is associated with increased risk of the condition (in the same way you would for any epidemiological “exposure”).
The problem was that people searched in samples that were too small, analysing too many variants at the same time, without correcting their thresholds for statistical significance for all those tests, sometimes added in covariates like sex or environmental exposures (increasing the number of tests multiplicatively), performed inferential statistics on exploratory data, typically didn’t include an independent replication sample, and almost exclusively only published “positive” findings. The result? A decade’s worth of work that generated a literature of false positives, and a secondary literature aiming to work out the mechanisms underlying signals that were not real in the first place.
Thankfully, these problems were recognised by the human genetics community (including journals and funders), prompting a seismic shift in how genetics is carried out. The only solution was to cooperate, and very large consortia were formed to allow collection of samples from tens of thousands of people (rather than tens or hundreds). Technology was developed to assay common genetic variants across the entire genome, all in the same genome-wide association study (GWAS). Computational and statistical methods were also developed to ensure a high level of rigor: strict correction for multiple tests, with a genome-wide significance level of 5 x 10-8, control of possible confounds as much as possible (such as population stratification or batch effects from different sites), along with independent replication samples. Importantly, results for all variants are published, whether the individual association signals reach genome-wide significance or not.
The genetics community thus took the underlying problems seriously and acted to change the culture and practice of the whole field. The result has been a decade of discovery of thousands of common variants statistically robustly associated with all kinds of traits and disorders. (Whether you think such discoveries are actionable is another question – more on that below).
Neuroimaging is currently facing up to the same problems. A recent study by Scott Marek and colleagues shows that so-called brain-wide association studies (or BWAS, where thousands of neuroimaging parameters are compared across groups of cases and controls, looking for some kind of difference somewhere) require the same kind of methodological rigor as GWAS. That means much bigger samples than are currently used – in the thousands, rather than the tens. Again, this will require a cultural change in how this research is carried out – the cottage industry approach just won’t produce reliable results. (Indeed, the literature claiming to have found all kinds of imaging “biomarkers” of all kinds of psychiatric conditions is as inconsistent and irreproducible as the candidate gene association literature).
Meanwhile, some other topics of relevance to neurodevelopmental disorders seem to have not yet gotten the memo on reproducibility. This includes a lot of the work on supposed transgenerational epigenetic effects of stress or trauma, for example, or claims of casual effects of various components of the gut microbiome. The methods used in these fields suffer from all the same problems outlined above, which are guaranteed to produce more noise than signal.
The general solutions are pretty clear: bigger samples, fewer degrees of freedom in the analyses, proper corrections for multiple testing, independent samples for replication of exploratory findings before publication, and publishing both positive and negative results. All of those can be supported by the use of Registered Reports, where a research design is submitted to a journal and peer-reviewed (with an opportunity to improve the design at that stage), and where, if the design is approved, the journal agrees to publish the paper describing it regardless of how the results turn out. This provides a much more robust way of doing open and reproducible science, that aims to undercut the perverse incentives that can lead to systemic biases, especially publication bias.
The recognition of these problems and the changes that various fields are implementing to solve them are very welcome. They’re crucial in fact, and should be a core part of the training of all scientists, as well as the upskilling of those of us who were trained before these issues came to light. But I wonder sometimes if they’re enough to ensure the science we do is not just reproducible but actually productive. What we should surely be aiming for is a truly progressive and collective deepening of our understanding of complex issues. To achieve that, we may need more conceptual clarity and, frankly, effort, than is sometimes demonstrated.
Philosophers have a phrase they often employ, when talking or writing about their work – kind of a self-narration of the processes of thought. They will often say: “What I take myself to be doing here is…”. This might be something like: I take myself to be making a claim about the ontic rather than the epistemic status of something or other (i.e., what kind of a thing something is, rather than just what we know about it). I used to find this habit a bit pretentious, even precious, but the more I encounter it, the more valuable I think it is.
Really, it’s a discipline of reflection on the activity one is engaged in. It makes you make explicit the premises and assumptions on which a line of reasoning is based. The reason to lay this out there, for all to see, is that you might be wrong – not about your conclusions, but, more fundamentally, about what it is you think you’re doing. Philosophers make these declarations precisely so that others can challenge them, but it is, more fundamentally, a useful exercise in clarifying your own thoughts to yourself.
I’d like to see more scientists adopt this habit. It would be hugely useful to lay out the premises and assumptions underlying any given experiment or piece of research. Not just the proximal ones that have prompted some specific hypothesis, but the deeper ones that underpin the approach in general – the ones that often go unstated and unexamined.
For example, if we’re proposing to carry out a GWAS of some psychiatric condition, what is it we’re hoping to find? Presumably a list of associated genetic variants, but why? What use will they be? Will they tell us about the underlying “biology of the condition”? That phrase may have very different meanings, depending on the nature of the condition. “The biology” of cancer lies at the level of proteins controlling cellular differentiation, proliferation, cell cycle control, DNA repair, and so on. These are tightly linked to the functions of implicated genes. In contrast, “the biology” of something like autism or schizophrenia manifests at the level of the highest functions of the human mind – social cognition, conscious perception, language, organised thought. Finding the genes that convey some risk for the condition is highly unlikely to immediately inform on the biology underpinning those cognitive processes and psychological phenomena.
Instead, those GWAS have pointed in general at genes involved in neural development, reinforcing the view of the symptoms as emergent phenotypes, not directly linked to the functions of the encoded proteins. Given that is the case, we might ask what we take ourselves to be doing if we’re proposing carrying out ever-larger GWAS of these conditions with bigger and bigger samples. I’m not arguing against it, just suggesting that it’s worth making explicit what is to be gained from such an exercise. The promised insights into “the biology” of the condition are unlikely to just pop out of such studies, nor will they directly produce a list of new molecular therapeutic targets, as often suggested.
More generally, we can ask: what kinds of things are our diagnostic categories? Do we take a category like autism or depression or bipolar disorder or schizophrenia as monolithic, representing a natural kind, or instead as a diagnosis of exclusion that may encompass a myriad of etiologies and pathologies? Our basic conceptions here are crucial in answering the question of what we take ourselves to be doing when, for example, we put a hundred people with autism in the MRI scanner and compare them to a hundred neurotypical people.
If, say, we’re looking for some structural brain differences between these groups (as has often been done), we should be able to explain why we might expect to find such a thing. The genetics has clearly shown us that “autism” is an umbrella term that describes an emergent cluster of symptoms at the psychological and behavioral levels, linked with extremely diverse genetic etiologies. Should we then expect some commonalities across patients at the level of brain structure, though they may have a hundred different genetic causes? Would we propose such an experiment for a category like “intellectual disability”?
Analyses of gene expression patterns or epigenetic marks in the (post mortem) brains of subjects with these conditions are similarly vaguely justified, if at all. What is the underlying premise that is being tested? That all those diverse genetic etiologies might converge on a pattern of gene expression that underlies the observed symptoms? Should we expect the symptoms to have a direct molecular underpinning like that? Or do they reflect instead emergent activity regimes of the dynamical neural systems of the brain?
Alternatively, should we expect to see a direct signature of the primary genetic disruptions in the gene expression patterns of various parts of the adult brain? Attempts to link altered gene expression profiles to the genes directly implicated by GWAS seem to imply this idea, yet this premise is not made explicit or justified in the relevant publications. And it is not obvious, under that model, why so many distinct genetic etiologies would then lead to a consistent signature across patients.
You could do these kinds of projects with all the statistical and methodological rigor that could possibly be brought to bear and still not learn anything useful, if they are not founded on clear conceptual premises.
The same is true in many areas of cellular and animal neuroscience. Again in relation to neurodevelopmental disorders, if we make a mouse model of a mutation associated with high risk of autism in humans, is this a model “of autism”? Should we expect some behavioral similarities between the phenotypes observed in the mouse and in humans with autism? It’s probably better to think of the mouse as just a model of the effects of that particular mutation, but effects at what level? On biochemical pathways, developmental outcomes, function of neural circuits or systems, emergent behavioral phenotypes?
Again, how close a correspondence should we expect between what we see in a mutant mouse at any of these levels and what we observe in humans? Given that the effects of even high-risk mutations can vary hugely across individual humans due to differing genetic backgrounds and idiosyncratic developmental variation, what should we expect from a single, arbitrary (but inbred) genetic background in the mouse? None of this is intended to argue against doing this kind of work in cellular or animal models – it is merely a call for more conceptual clarity in laying out what can and can’t be gained from investigations at any given level.
To sum up, the focus on improving statistical and methodological rigor in these and in all fields is crucial if we want to make our science robust and reproducible. But, if we don’t take the time to ask ourselves what we take ourselves to be doing, we’re likely to end up doing something other than what we think. Poorly conceptualised experiments can waste just as much time and resources as poorly executed ones.