Telling good science from bad – a user’s guide to navigating the scientific literature

“Did you find it convincing?” That’s what one of my genetics professors used to ask us, a small group of undergraduates who blinked in response, like rabbits in the headlights. We didn’t know we were supposed to evaluate scientific papers. Who the hell were we to say if these papers – published by proper grown-up scientists in big-name scientific journals – were convincing or not? We thought published, peer-reviewed papers contained The Truth. But our professor (the always inspirational David McConnell at Trinity College Dublin) wasn’t about to let us off so easily. We learned quickly that it absolutely was our job, as fledgling scientists, to learn how to evaluate scientific papers.


This is not a responsibility that can be offloaded to peer reviewers and journal editors. It’s not that pre-publication peer review, when it works as intended, doesn’t perform a useful function. But those supposed gatekeepers of knowledge are as fallible as the primary producers of the research itself, as prone to hype and fads and influence, as steeped in the assumptions of their field, as invested in the shared paradigms and methods, and every bit as wrapped up in the sociological enterprise that is the modern reality of science.


Passing pre-publication peer review is, regrettably, no guarantee of quality or scientific rigour. Nor is publication in high-impact journals, which tend to prize novelty in both methods and results over incremental advances, which, by definition, are more likely to be robust. Individual readers of the scientific literature – researchers, clinicians, policy-makers, journalists, and especially students – should be empowered to critically evaluate scientific publications.


This is especially important now as preprints become more widely adopted, most recently in biology. The rise of preprint servers such as bioRxiv and medRxiv is a very welcome advance, busting the stranglehold of journals on access to scientific research. But it will require a change in how readers and members of the field engage with published research (and yes, posting a preprint is making the research public).


The internet has democratised scientific publishing and it can and should democratise the peer review process too. The hive-mind can take the place of a few selected reviewers and editors in judging what work is robust and reliable. Of course we’re not all technical experts in every area, but there are some general things to look out for that can help distinguish solid work from more shaky offerings. Other will have their own lists, but here is what I take to be markers of quality in scientific papers, or, conversely, red flags:


The good:


1.     Based on a solid foundation of prior work

2.     Clearly articulated hypothesis, with an experimental design appropriate to test it

3.     Alternatively, presentation of the work as descriptive or exploratory

4.     Right question asked at right level with right tools

5.     Experimental controls

6.     Sufficient statistical power

7.     Appropriate statistical analyses

8.     Converging lines of evidence

9.     Internal replication

10. Conclusions supported by the data


All of which is to say that the papers I find most convincing build on previous work to generate a testable hypothesis or identify something worth exploring, design an experiment or an analysis at the right level to test it, perform the appropriate controls (whether experimental or statistical), don’t rely on a single method or results from a single experiment or analysis but draw on multiple, preferably diverse lines of evidence (which may each have their own limitations or confounds or biases, but at least they should be different ones), replicate their findings (especially effects or associations that showed up as ‘significant’ but that were not hypothesised a priori), and limit their conclusions to what is actually supported by the data.


In short, I like to get a sense that the authors went to great lengths to convince themselves of their findings and have been cautious and thoughtful in their interpretation. Conversely, these are the characteristics that make me say “nope” (sometimes out loud) and move on to the next paper:


The bad:


1.     No clear rationale or hypothesis

2.     Not founded on solid body of work

3.     Wrong level to address the question

4.     Reliance on single method or analysis

5.     Statistically underpowered

6.     Data-dredging: just looking through many measures for some statistically significant ‘finding’ – any effect or association in any direction

7.     Covariate mining, multiplying likelihood of spurious ‘findings’

8.     Failure to correct for multiple tests

9.     No replication

10. High likelihood of publication bias


These are papers where it just doesn’t seem like the authors have a clear idea of what they’re asking – where the hypotheses are vague or not well justified – and where the experimental design, though it may generate data (sometimes lots and lots of data), is not well suited to actually answering any question, at least not at the level that the authors are interested in (e.g., looking to transcriptomics data to answer a systems neuroscience question, or MRI data to address a psychological question).


There may also be lots of technical experiments that are just poorly done (because many of them are frankly really hard), but it’s often difficult for non-experts to evaluate these. This is where reviewers who are real peers in the same subfield are invaluable. Dodgy stats are much more general, however, and the biggest red flag is a sample size that is simply too small to reliably detect the kind of effect the authors are interested in or are claiming is real. (The idea that if some ‘effect’ was detected with only a small, under-powered sample, it is more likely to be real, is a fallacy – such findings are far more likely to be false positives). And there are other well known ‘questionable research practices’ and biases that any reader can learn to be on the lookout for (see references below). Pre-registration – now offered by many journals – is a good protection against these dangers and a positive mark of reliability.

Here are a few examples that I have examined before: The Trouble with Epigenetics, Part 2; The Trouble with Epigenetics, Part 3 – over-fitting the noise; Grandma’s trauma – a critical appraisal of the evidence for transgenerational epigenetic inheritance in humans. They happen to be drawn from the field of transgenerational epigenetics, but the issues are very general and recognizable and discussed further here: Calibrating scientific skepticism; and here: On literature pollution and cottage-industry science.

Beyond these issues of experimental design, technical execution, and proper statistics, there are some sociological factors that tend to make me more skeptical than I would be otherwise when reading some papers:


The dodgy:


1.     Part of currently sexy, trendy field

2.     Appeals to other dodgy papers as support for general paradigm

3.     Hype, spin, excessive claims of novelty

4.     Over-interpretation or extrapolation beyond what the data support

5.     Claiming to overturn mainstream thought

6.     Someone selling something (or the potential for them to)


You would think that extraordinary claims would require extraordinary evidence. The reality in a lot of scientific publishing is that such claims can get published with only suggestive evidence because, if they turn out to be right, the paper will get hugely cited. (And even if they eventually fizzle or are shown not to replicate, the paper may still attract lots of attention and citations for some time). And when peer reviewers in the same field have a shared interest in the general paradigm being a thing (e.g., transgenerational epigenetics or social priming or the gut microbiome affecting our psychology), then every additional paper that gets published can be used as support for their own future publications (or grant applications).


As for selling something, sometimes this is quite overt and declared as a possible conflict of interest. But other times the work being presented is part of a long process of development of some reagent or approach that the authors hope may have clinical or commercial value in the future. There’s nothing wrong with that, in principle – developing new therapeutics is what pre-clinical work is all about, after all. And potential profit is one of the incentives that drives that work. But this incentive can lead to an exaggeration of claims of efficacy, over-extrapolation beyond the experimental system used, selective focus on supposedly supporting lines of evidence, and so on. It may be overly cynical, but in these scenarios, I set the gain on my skepticometer a little higher.


The not even wrong:


Finally, there are some papers where the experimental work and statistical analyses are all fine, but that suffer from much deeper problems in the underlying paradigm. These are ones where the conceptual foundation is vague, with unexamined and unjustified assumptions, poorly framed questions, fundamental category errors, or other theoretical or philosophical issues that mean the authors are not investigating what they think they’re investigating.


A common example of interest to me is papers that say such-and-such (a pattern of functional connectivity between some brain areas, or a profile of post mortem gene expression, or some other supposed biomarker) is the case “in autism” or “in schizophrenia” or “in depression”. These implicitly treat these psychiatric diagnostic categories as natural kinds, when we know that these labels are diagnoses of exclusion for conditions that are actually incredibly heterogeneous, both clinically and etiologically. My heuristic in evaluating these papers is to replace the offending phrase with “in intellectual disability” (which people don’t make the same error with) and see if the approach makes any sense at all.

Other examples include expecting complex behaviours (often with crucial societal and cultural factors at play) to be reducible to: differences in size of certain brain areas (e.g., Are bigger bits of brains better?; The murderous brain - can neuroimaging really distinguish murderers?; Debunking the male-female brain mosaic); levels of expression of certain genes (Epigenetics: what impact does it have on our psychology?; If genomics is the answer, what's the question? A commentary on PsychENCODE); or polygenic scores (Is your future income writtenin your DNA?).

These are examples from my own fields of research – I’m sure readers will have their own bugbears that exercise them as much as these ones do me. (A common one in psychology is assuming that effects seen in brief experiments in highly artificial lab settings will somehow translate to actual behavior in the real world).

Then there are much deeper philosophical issues affecting whole fields, for example whether a reductionist approach in biology is the right one, whether mechanistic concepts of cells or computational concepts of brains are appropriate, whether the multifarious factors contributing to complex processes can in any sense be disentangled, differences between predicting, controlling, explaining, and understanding, and so on.

Anyway, that’s my personal user’s guide to navigating the scientific literature. It may seem cynical, maybe even arrogant to make these kinds of judgments of other people’s work (though I should confess to having made and I hope learned from some of the mistakes listed above myself, both as a researcher and a reviewer). But life is short and the literature is vast and growing at a relentless pace – considering the issues listed above lets me winnow what’s worth reading in detail to a more manageable pile. Beyond that, if we want the general public and our policy-makers to “trust the science” in making decisions on issues of global impact, then we have a collective responsibility, in my view, to help curate the scientific literature.

Useful reading:


Ioannidis (2005). Why Most Published Research Findings Are False


Button et al (2013) Power failure: why small sample size undermines the reliability of neuroscience. 


Stuart Ritchie (2021): Science Fictions – Exposing Fraud, Bias, Negligence, and Hype in Science. 2021. Penguin books. (A super read and one I recommend for all undergrads starting out in biology, psychology, and related fields).

Dorothy Bishop, Oxford Reproducibility Lectures.

Dorothy Bishop (2019), Rein in the four horsemen of irreproducibility.



Björn Brembs (2019) Reliable novelty: New should not trump true.

Kevin Mitchell (2017): Neurogenomics –towards a more rigorous science.



Munafo et al. (2017) . A manifesto for reproducible science.



Tal Yarkoni (2020) The generalizability crisis



Makin et al., 2018. Ten common statistical mistakes to watch out for when

writing or reviewing a manuscript.








Popular posts from this blog

Undetermined - a response to Robert Sapolsky. Part 1 - a tale of two neuroscientists

Grandma’s trauma – a critical appraisal of the evidence for transgenerational epigenetic inheritance in humans

Undetermined - a response to Robert Sapolsky. Part 2 - assessing the scientific evidence