What do GWAS signals mean?
Genome-wide association studies (GWAS) have been highly successful at linking genetic variation in hundreds of genes to an ever-growing number of traits or diseases. The fact that the genes implicated fit with the known biology for many of these traits or disorders strongly suggests (effectively proves, really) that the findings from GWAS are “real” – they reflect some real biological involvement of those genes in those diseases. (For example, GWAS have implicated skeletal genes in height, immune genes in immune disorders, and neurodevelopmental genes in schizophrenia).
But figuring out the nature of that involvement and the underlying biological mechanisms is much more challenging. In particular, it is not at all straightforward to understand how statistical measures derived at the level of populations relate to effects in individuals. Here, I explore some of the diverse mechanisms in individuals that may underlie GWAS signals.
GWAS take an epidemiological approach to identify genetic variants associated with risk of disease in exactly the same way epidemiologists identify environmental factors associated with risk – they look for factors that are more frequent in cases with a disease than in unaffected controls. For example, smoking is more common in people with lung cancer than in people without lung cancer (even though only a minority of people who smoke get lung cancer). From this we can deduce that smoking may be a risk-modifying factor for lung cancer, and we can measure the strength of that effect. Of course, observational epidemiology cannot prove causation – but it can provide important clues as to the risk architecture of a disease.
For GWAS, the factors in question are not environmental – they are the differences in our DNA that exist at millions of positions across the genome. These “single-nucleotide polymorphisms”, or SNPs, are positions in the genome where the DNA sequence varies between people – sometimes it might be an “A”, sometimes it might be a “T” (or a “G” or a “C”). Of course, any position in the genome can be mutated and likely is mutated in someone on the planet, but such mutations are typically extremely rare. SNPs are different – they are positions where two different versions are both relatively frequent in the population; these versions are thus often referred to as common variants.
GWAS are premised on the simple idea that if any of those common variants at any of those millions of SNPs across the genome is associated with an increased risk of disease, then that variant should be more frequent in cases than in controls. So, if we find variants that are more common in cases than in controls, we can infer that these variants may be causally related to an increased risk of disease.
What that doesn’t tell us is how. How does having one variant over another at that particular site cause an increased risk of that particular disease? I don’t just mean by what biological mechanism; I mean how does risk calculated at the population level relate to effects in individuals?
Statistically, we get two measures out of GWAS for any SNP that is associated. One is the p-value, which is a measure of how unlikely it would be to see a frequency difference of the magnitude we observe, just by chance. You might, for example, find that the “A” version at one SNP is at 25% frequency in controls but 28% frequency in cases. That’s not a big difference, so you’d need a very big sample to make sure it wasn’t noise, which is precisely why GWAS now use sample sizes of tens or even hundreds of thousands of people.
GWAS also apply very rigorous thresholds for statistical significance, in order to correct for the fact that they are testing so many different SNPs. (This follows the logic that, while it is quite unlikely that you will win the lottery yourself, if enough tickets are sold, it won’t be surprising if the lottery is won by somebody). These methods have greatly advanced the trustworthiness of results from the field, far beyond those reported in the benighted “candidate gene era”. But the p-value doesn’t tell us anything about how big of an effect there is – how much of an effect on risk does the difference in frequency between cases and controls reflect?
That number is summarised by the other measure we get for each associated SNP, which is the odds ratio. This reflects the size of the difference in frequency of that variant between cases and controls. It is calculated very simply: say your SNP comes in two versions, or “alleles”: “A” and “G”. We want to convert the difference in absolute frequencies in cases versus controls (say 28% vs 25%, or 62% vs 60%, or whatever it is) into a number that tells us how many times more common is one version in cases versus controls. (The reason is that that number is more easily related to the increased risk associated with having that version).
Here’s an example: If we take 28% and 25% as frequencies of the “A” allele at a certain SNP in cases and controls, respectively, then if you were to select an “A” allele at random from the sample, the odds of it coming from a case versus a control is 0.28/0.25 (=1.12). The odds of the alternative “G” allele occurring in a case versus a control is correspondingly lower: 0.72/0.75 (=0.96). The odds ratio is then 1.12/0.96 = 1.167. Assuming that the cases and controls are representative of the general population, we can infer that individuals with an “A” allele are 1.167 times more likely to be a case, compared to those with the “G” allele, which is the number we’re after. (Note that this approximation of odds ratio to relative risk only holds when the disease is rare).
If you do the same calculations for 62% vs 60% it works out to 1.09. These odds ratios are on the order of the typical values obtained from GWAS. For comparison, the odds ratio for smoking and lung cancer is around 30. It is calculated in the same way, e.g., from data like these from a study in Spain in the 1980’s (where smoking was apparently astronomically common!): this study found that 98.8% of lung cancer patients were smokers, while “only” 80.3% of controls were smokers. Doing the same calculations as above gives an OR = 29.1, which is consistent with many other studies.
Thus, for either genetic or environmental factors, the odds ratio gives an average increased risk of disease. But, biologically, what is actually going in each individual that collectively gives that signal?
The most straightforward interpretation is that an odds ratio of, say, 1.2 at the population level reflects exactly the same thing at the individual level – each individual who inherits that SNP variant is at 1.2 times greater risk of developing the disease than they would been otherwise. This is the additive model whereby each SNP acts independently of all other factors – it doesn’t matter what other genetic variants a person has, or indeed what environmental factors they may be exposed to – the added effect on risk of this SNP is the same in all carriers.
That is, I think, a pretty common interpretation of what the odds ratio means in individuals, but it is certainly not the only scenario that could produce that result at the population level. In the diagram below, I illustrate several different scenarios that could all yield the same odds ratio across the population.
The additive scenario is illustrated in A. Every person who inherits the risk allele has a slightly increased risk of disease (small red arrows). [This applies whether the SNP that is genotyped in the GWAS has a functional effect itself or tags another common SNP that is the one doing the damage].
It might seem like the odds ratio can be interpreted directly as a multiplier of the baseline risk across the population, i.e., the prevalence of the disease in question. So, if the baseline rate is say 1%, then people with the “A” allele in our example above would have a risk of 1.167%, all other things being equal. The problem with that interpretation is that all other things are not equal.
For example, a condition like autism affects about 1% of the population. This does not mean, however, that everyone in the population had a 1% risk of being born autistic, and that the ones who actually are autistic were just unlucky (statistically speaking, not judgmentally). That 1% is actually made up of people who were at very high risk of being autistic – we know this because people with the same genotype as those with autism (i.e., their monozygotic twins) have a rate of autism of over 80%. What this implies is that the vast majority of the population were at effectively no risk (not at 1% risk).
This suggests that the effects of any SNP are also likely to be highly unequally distributed across the population*, depending on the genetic background, as illustrated in Scenario B. In some people, the risk variant increases risk a little bit (small red arrows), while in others it increases it a lot (bigger red arrows). In others it may have no effect (flat blue line), while in yet others it may actually decrease risk (green downward arrow).
That last situation may seem far-fetched but is actually well described; for example, two mutations that each independently cause epilepsy may paradoxically cancel each other out if they occur together. Similarly, mutations in the fragile X gene, Fmr1, or in the tuberous sclerosis gene, Tsc2, can each cause autism in humans and various neurological and behavioural symptoms when mutated in mice. However, combining them both in mice leads to a rescue of the symptoms caused by either one alone (because they counteract each other at the biochemical level).
These kinds of “epistatic” (non-additive) interactions are generally very common and can be seen for all kinds of complex traits. In terms of how they would contribute to a GWAS signal, a slight preponderance of increased risk when you average those effects across the population would generate a small odds ratio greater than 1. Based on the odds ratio alone, there is no way to distinguish scenarios A and B.
Note that this kind of effect holds for all epidemiological data – the effect sizes obtained are always averages across the population which may hide substantial variability in effect size across individuals. For example, a high-fat diet may be a much higher risk factor for cardiovascular disease in some people than in others, based on their genetic vulnerability.
It is interesting to note that if those kinds of diverse epistatic interactions occur for each SNP, then their aggregate effects will likely always look additive, as these pairwise and higher-order interactions will average out both among and across individuals. That doesn’t mean they could not in principle be decomposed to reveal such effects, as can be done using various genetic techniques in model organisms. So, just because SNP effects seem to combine additively does not rule out multiple epistatic interactions at the biological level.
Scenario C is a special case of epistatic interaction. In this case, the common risk variant has no effect on biological risk at all in most carriers (flat blue lines). However, if it occurs in people with a rare mutation in some specific gene (big purple arrow), which by itself predisposes to the disease with incomplete penetrance (where not everyone with the mutation necessarily develops the disease), then it can have a modifying effect, strongly increasing the likelihood of actual expression of the disease symptoms.
Again, this kind of scenario is well documented and is particularly well illustrated by Hirschsprung disease. This disorder, which affects innervation of the gut, can be caused by mutations in any one of about 18 known genes, one of which encodes the Ret tyrosine kinase. However, mutations in this gene are not completely penetrant – some people with it do not develop disease or have only a mild form. Recent studies have found that simultaneously carrying a common variant in the same gene increases the likelihood that carriers of the rare mutation will show severe disease. The common variant thus modifies the risk of disease substantially, but only in carriers of a rare mutation. (In this case it is in the same gene, but that doesn’t have to be the case).
The last scenario, D, is quite different. Here, the common variant is not doing anything itself. It’s not even linked to another common variant that is doing something. Instead, it is linked to a rare mutation that causes disease with much higher penetrance. Or, to put it better, the rare mutation is linked to it. Any new mutation must arise on a background of some set of common SNPs (a “haplotype”), with which it will tend to be subsequently co-inherited. If a rare mutation that increases risk of disease rises to an appreciable frequency then it will necessarily increase the frequency of the SNPs in that haplotype in people with the disease, giving rise to what has been called a “synthetic association”.
Any one mutation might be too rare to cause such an effect (especially if it is likely to be selected against precisely because it causes disease), but if you have multiple rare mutations at a given locus, and if they happen to occur by chance more on one haplotype than another, then you could get an aggregate effect that could give a tiny difference in frequency of the sort detected by GWAS.
There are now many documented examples where GWAS signals are explained by synthetic associations with rare mutations in the sample, which have much larger odds ratios (e.g., 1, 2, 3, 4). On the other hand, there are also cases where no such rare mutations have been found (e.g., 5, 6), suggesting that such a mechanism is by no means universal. It is difficult indeed to know how prevalent that situation will turn out to be, though large-scale whole-genome sequencing studies currently underway should help address this question. (See here for theoretical discussions: 7, 8, 9, 10).
Both scenarios C and D are congruent with the repeated finding that many of the genes implicated by GWAS (with small effect sizes) are known to sometimes carry rare mutations linked to a high risk of the same disease. That would fit with a mechanism whereby common variants at a given locus increase the penetrance of rare mutations in the same gene, but have little effect otherwise (scenario C). Or it would fit with GWAS signals actually arising from synthetic association with high-penetrance rare mutations in the population (where the common variant tags these haplotypes but has no effect itself whatsoever; scenario D).
Teasing these various scenarios apart is a challenge, especially as, for any given disease, different scenarios may pertain for different SNPs. One method has been to try and find a functional effect of a common SNP at the molecular level. For example, SNPs may affect the expression of a gene, altering binding of regulatory proteins to the parts of DNA that specify how much of the protein to make, in which cells and under which conditions. Multiple such examples have been documented (sometimes with surprising results, as when the gene thus affected is actually quite distant to the SNP itself).
However, finding some effect of a common SNP on expression of a gene at a molecular level does not explain how it affects disease risk. Any of scenarios A, B or C could still pertain, and even scenario D is not ruled out by such findings. Indeed, it is not even clear what kind of molecular-level effect we should expect to explain a tiny odds ratio. Should we expect a small effect at the molecular level, or a big effect at the molecular level that translates to a small effect at the organismal level? Or a big effect at the organismal level, but only in combination with other genetic or environmental insults?
That leaves something of a Catch-22 situation for researchers looking for functional effects of SNPs at the biological level – too small an effect and it will never be detected in messy biological experiments; too big and it will have a rather glaring discrepancy with the epidemiological odds ratio. In the end, it may prove impossible to definitively investigate such small individual epidemiological effects at the biological level, whether from genetic or environmental factors.
This doesn’t mean individual GWAS signals are not useful, of course – they certainly point to loci of interest for further study and have successfully implicated previously unknown biochemical pathways in various diseases (e.g., autophagy in Crohn’s disease). It does mean, however, that the interpretation of individual SNP associations may remain a bit vague.
On the other hand, while the biological effect of any single SNP in isolation may be small, their aggregate effect should be large, at least if the model of disease being cause by a polygenic load of such common risk alleles is correct. Indeed, even if the burden of common alleles is not by itself sufficient to cause disease (e.g., in a scenario where they act collectively as a polygenic modifier of rare mutations, which I consider the most likely scenario), they may still have biological effects in aggregate on relevant traits.
There is now an ever-growing number of studies taking that approach, correlating polygenic scores of risk for various diseases (based on aggregate SNP burden) with a range of biological phenotypes. Whether this approach will really help reveal underlying pathogenic mechanisms remains to be seen. More on that in a later post.
With thanks to John McGrath for helpful comments and edits.
*The usual way around this is to model the effects of a SNP on the liability scale, rather than the observed scale of risk. This is based on the idea that underlying the observed discontinuous distribution of a disease is a normally distributed burden of liability, which effectively remains latent until some threshold of burden is passed, in which case disease results. As a mathematical model to describe risk across the population this works reasonably well, given a host of assumptions. It is a mistake, however, in my mind, to think that the model reflects pathogenic mechanisms in individuals.