What do GWAS signals mean?
Genome-wide association studies (GWAS) have been highly successful at linking genetic variation in hundreds of genes to an ever-growing number of traits or diseases. The fact that the genes implicated fit with the known biology for many of these traits or disorders strongly suggests (effectively proves, really) that the findings from GWAS are “real” – they reflect some real biological involvement of those genes in those diseases. (For example, GWAS have implicated skeletal genes in height, immune genes in immune disorders, and neurodevelopmental genes in schizophrenia).
But figuring out the nature of that
involvement and the underlying biological mechanisms is much more challenging.
In particular, it is not at all straightforward to understand how statistical
measures derived at the level of populations relate to effects in individuals.
Here, I explore some of the diverse mechanisms in individuals that may underlie
GWAS signals.
GWAS take an epidemiological approach to
identify genetic variants associated with risk of disease in exactly the same
way epidemiologists identify environmental factors associated with risk – they
look for factors that are more frequent in cases with a disease than in
unaffected controls. For example, smoking is more common in people with lung
cancer than in people without lung cancer (even though only a minority of
people who smoke get lung cancer). From this we can deduce that smoking may be
a risk-modifying factor for lung cancer, and we can measure the strength of
that effect. Of course, observational epidemiology cannot prove causation – but
it can provide important clues as to the risk architecture of a disease.
For GWAS, the factors in question are not
environmental – they are the differences in our DNA that exist at millions of
positions across the genome. These “single-nucleotide polymorphisms”, or SNPs,
are positions in the genome where the DNA sequence varies between people –
sometimes it might be an “A”, sometimes it might be a “T” (or a “G” or a “C”). Of
course, any position in the genome can be mutated and likely is mutated in
someone on the planet, but such mutations are typically extremely rare. SNPs
are different – they are positions where two different versions are both
relatively frequent in the population; these versions are thus often referred
to as common variants.
GWAS are premised on the simple idea that if
any of those common variants at any of those millions of SNPs across the genome
is associated with an increased risk of disease, then that variant should be
more frequent in cases than in controls. So, if we find variants that are more
common in cases than in controls, we can infer that these variants may be
causally related to an increased risk of disease.
What that doesn’t tell us is how. How does
having one variant over another at that particular site cause an increased risk
of that particular disease? I don’t just mean by what biological mechanism; I mean
how does risk calculated at the population level relate to effects in
individuals?
Statistically, we get two measures out of
GWAS for any SNP that is associated. One is the p-value, which is a measure of
how unlikely it would be to see a frequency difference of the magnitude we
observe, just by chance. You might, for example, find that the “A” version at
one SNP is at 25% frequency in controls but 28% frequency in cases. That’s not
a big difference, so you’d need a very big sample to make sure it wasn’t noise,
which is precisely why GWAS now use sample sizes of tens or even hundreds of
thousands of people.
GWAS also apply very rigorous thresholds
for statistical significance, in order to correct for the fact that they are
testing so many different SNPs. (This follows the logic that, while it is quite
unlikely that you will win the lottery yourself, if enough tickets are sold, it
won’t be surprising if the lottery is won by somebody). These methods have
greatly advanced the trustworthiness of results from the field, far beyond
those reported in the benighted “candidate gene era”. But the p-value doesn’t
tell us anything about how big of an effect there is – how much of an effect on
risk does the difference in frequency between cases and controls reflect?
That number is summarised by the other
measure we get for each associated SNP, which is the odds ratio. This reflects
the size of the difference in
frequency of that variant between cases and controls. It is calculated very
simply: say your SNP comes in two versions, or “alleles”: “A” and “G”. We want
to convert the difference in absolute frequencies in cases versus controls (say
28% vs 25%, or 62% vs 60%, or whatever it is) into a number that tells us how many times more common is one
version in cases versus controls. (The reason is that that number is more
easily related to the increased risk associated with having that version).
Here’s an example: If we take 28% and 25%
as frequencies of the “A” allele at a certain SNP in cases and controls,
respectively, then if you were to select an “A” allele at random from the
sample, the odds of it coming from a case versus a control is 0.28/0.25
(=1.12). The odds of the alternative “G” allele occurring in a case versus a
control is correspondingly lower: 0.72/0.75 (=0.96). The odds ratio is then
1.12/0.96 = 1.167. Assuming that the
cases and controls are representative of the general population, we can infer
that individuals with an “A” allele are 1.167 times more likely to be a case,
compared to those with the “G” allele, which is the number we’re after. (Note
that this approximation of odds ratio to relative risk only holds when the
disease is rare).
If you do the same calculations for 62% vs
60% it works out to 1.09. These odds ratios are on the order of the typical
values obtained from GWAS. For comparison, the odds ratio for smoking and lung
cancer is around 30. It is calculated in the same way, e.g., from data like
these from a study in Spain in the 1980’s (where smoking was apparently
astronomically common!): this study found that 98.8% of lung cancer patients
were smokers, while “only” 80.3% of controls were smokers. Doing the same
calculations as above gives an OR = 29.1, which is consistent with many other
studies.
Thus, for either genetic or environmental
factors, the odds ratio gives an average increased risk of disease. But, biologically,
what is actually going in each individual that collectively gives that signal?
The most straightforward interpretation is
that an odds ratio of, say, 1.2 at the population level reflects exactly the
same thing at the individual level – each individual who inherits that SNP
variant is at 1.2 times greater risk of developing the disease than they would
been otherwise. This is the additive model whereby each SNP acts independently
of all other factors – it doesn’t matter what other genetic variants a person
has, or indeed what environmental factors they may be exposed to – the added effect
on risk of this SNP is the same in all carriers.
That is, I think, a pretty common
interpretation of what the odds ratio means in individuals, but it is certainly
not the only scenario that could produce that result at the population level. In
the diagram below, I illustrate several different scenarios that could all
yield the same odds ratio across the population.
The additive scenario is illustrated in A.
Every person who inherits the risk allele has a slightly increased risk of
disease (small red arrows). [This applies whether the SNP that is genotyped in
the GWAS has a functional effect itself or tags another common SNP that is the
one doing the damage].
It might seem like the odds ratio can be
interpreted directly as a multiplier of the baseline risk across the
population, i.e., the prevalence of the disease in question. So, if the
baseline rate is say 1%, then people with the “A” allele in our example above
would have a risk of 1.167%, all other things being equal. The problem with
that interpretation is that all other things are not equal.
For example, a condition like autism
affects about 1% of the population. This does not mean, however, that everyone
in the population had a 1% risk of being born autistic, and that the ones who
actually are autistic were just unlucky (statistically speaking, not judgmentally).
That 1% is actually made up of people who were at very high risk of being
autistic – we know this because people with the same genotype as those with autism (i.e., their monozygotic twins) have a rate of autism of over 80%. What
this implies is that the vast majority of the population were at effectively no
risk (not at 1% risk).
This suggests that the effects of any SNP are
also likely to be highly unequally distributed across the population*, depending
on the genetic background, as illustrated in Scenario B. In some
people, the risk variant increases risk a little bit (small red arrows), while
in others it increases it a lot (bigger red arrows). In others it may have no
effect (flat blue line), while in yet others it may actually decrease risk
(green downward arrow).
That last situation may seem far-fetched but
is actually well described; for example, two mutations that each independently
cause epilepsy may paradoxically cancel each other out if they occur together.
Similarly, mutations in the fragile X gene, Fmr1, or in the tuberous sclerosis
gene, Tsc2, can each cause autism in humans and various neurological and
behavioural symptoms when mutated in mice. However, combining them both in mice
leads to a rescue of the symptoms caused by either one alone (because they
counteract each other at the biochemical level).
These kinds of “epistatic” (non-additive)
interactions are generally very common and can be seen for all kinds of complex
traits. In terms of how they would contribute to a GWAS signal, a slight
preponderance of increased risk when you average those effects across the
population would generate a small odds ratio greater than 1. Based on the odds
ratio alone, there is no way to distinguish scenarios A and B.
Note that this kind of effect holds for all
epidemiological data – the effect sizes obtained are always averages across the
population which may hide substantial variability in effect size across
individuals. For example, a high-fat diet may be a much higher risk factor for
cardiovascular disease in some people than in others, based on their genetic
vulnerability.
It is interesting to note that if those
kinds of diverse epistatic interactions occur for each SNP, then their
aggregate effects will likely always look additive, as these pairwise and
higher-order interactions will average out both among and across individuals.
That doesn’t mean they could not in principle be decomposed to reveal such effects, as can be done using various genetic techniques in model organisms.
So, just because SNP effects seem to combine additively does not rule out
multiple epistatic interactions at the biological level.
Scenario C is a special case of epistatic
interaction. In this case, the common risk variant has no effect on biological risk
at
all in most carriers (flat blue lines). However, if it occurs in people
with a rare mutation in some specific gene (big purple arrow), which by itself
predisposes to the disease with incomplete penetrance (where not everyone with
the mutation necessarily develops the disease), then it can have a modifying
effect, strongly increasing the likelihood of actual expression of the disease
symptoms.
Again, this kind of scenario is well
documented and is particularly well illustrated by Hirschsprung disease. This
disorder, which affects innervation of the gut, can be caused by mutations in
any one of about 18 known genes, one of which encodes the Ret tyrosine kinase.
However, mutations in this gene are not completely penetrant – some people with
it do not develop disease or have only a mild form. Recent studies have found
that simultaneously carrying a common variant in the same gene increases the
likelihood that carriers of the rare mutation will show severe disease. The
common variant thus modifies the risk of disease substantially, but only in
carriers of a rare mutation. (In this case it is in the same gene, but that
doesn’t have to be the case).
The last scenario, D, is quite different.
Here, the common variant is not doing anything itself. It’s not even linked to
another common variant that is doing something. Instead, it is linked to a rare
mutation that causes disease with much higher penetrance. Or, to put it better,
the rare mutation is linked to it. Any new mutation must arise on a background
of some set of common SNPs (a “haplotype”), with which it will tend to be subsequently
co-inherited. If a rare mutation that increases risk of disease rises to an
appreciable frequency then it will necessarily increase the frequency of the
SNPs in that haplotype in people with the disease, giving rise to what has been
called a “synthetic association”.
Any one mutation might be too rare to cause
such an effect (especially if it is likely to be selected against precisely
because it causes disease), but if you have multiple rare mutations at a given
locus, and if they happen to occur by chance more on one haplotype than
another, then you could get an aggregate effect that could give a tiny
difference in frequency of the sort detected by GWAS.
There are now many documented examples
where GWAS signals are explained by synthetic associations with rare mutations in
the sample, which have much larger odds ratios (e.g., 1, 2, 3, 4). On the other hand, there are also
cases where no such rare mutations have been found (e.g., 5, 6), suggesting that such a
mechanism is by no means universal. It is difficult indeed to know how
prevalent that situation will turn out to be, though large-scale whole-genome
sequencing studies currently underway should help address this question. (See here for theoretical discussions: 7, 8, 9, 10).
Both scenarios C and D are congruent with
the repeated finding that many of the genes implicated by GWAS (with small
effect sizes) are known to sometimes carry rare mutations linked to a high risk
of the same disease. That would fit with a mechanism whereby common variants at
a given locus increase the penetrance of rare mutations in the same gene, but
have little effect otherwise (scenario C). Or it would fit with GWAS signals actually
arising from synthetic association with high-penetrance rare mutations in the
population (where the common variant tags these haplotypes but has no effect
itself whatsoever; scenario D).
Teasing these various scenarios apart is a
challenge, especially as, for any given disease, different scenarios may
pertain for different SNPs. One method has been to try and find a functional
effect of a common SNP at the molecular level. For example, SNPs may affect the
expression of a gene, altering binding of regulatory proteins to the parts of
DNA that specify how much of the protein to make, in which cells and under
which conditions. Multiple such examples have been documented (sometimes with
surprising results, as when the gene thus affected is actually quite distant to
the SNP itself).
However, finding some effect of a common
SNP on expression of a gene at a molecular level does not explain how it affects
disease risk. Any of scenarios A, B or C could still pertain, and even scenario
D is not ruled out by such findings. Indeed, it is not even clear what kind of
molecular-level effect we should expect to explain a tiny odds ratio. Should we
expect a small effect at the molecular level, or a big effect at the molecular
level that translates to a small effect at the organismal level? Or a big
effect at the organismal level, but only in combination with other genetic or
environmental insults?
That leaves something of a Catch-22
situation for researchers looking for functional effects of SNPs at the biological
level – too small an effect and it will never be detected in messy biological
experiments; too big and it will have a rather glaring discrepancy with the epidemiological
odds ratio. In the end, it may prove impossible to definitively investigate
such small individual epidemiological effects at the biological level, whether
from genetic or environmental factors.
This doesn’t mean individual GWAS signals
are not useful, of course – they certainly point to loci of interest for
further study and have successfully implicated previously unknown biochemical
pathways in various diseases (e.g., autophagy in Crohn’s disease). It does
mean, however, that the interpretation of individual SNP associations may
remain a bit vague.
On the other hand, while the biological effect
of any single SNP in isolation may be small, their aggregate effect should be large, at least if the model of disease
being cause by a polygenic load of such common risk alleles is correct. Indeed,
even if the burden of common alleles is not by itself sufficient to cause
disease (e.g., in a scenario where they act collectively as a polygenic
modifier of rare mutations, which I consider the most likely scenario), they
may still have biological effects in aggregate on relevant traits.
There is now an ever-growing number of
studies taking that approach, correlating polygenic scores of risk for various
diseases (based on aggregate SNP burden) with a range of biological phenotypes.
Whether this approach will really help reveal underlying pathogenic mechanisms
remains to be seen. More on that in a later post.
With thanks to John McGrath for helpful
comments and edits.
*The usual way around this is to model the
effects of a SNP on the liability scale, rather than the observed scale of risk.
This is based on the idea that underlying the observed discontinuous
distribution of a disease is a normally distributed burden of liability, which
effectively remains latent until some threshold of burden is passed, in which
case disease results. As a mathematical model to describe risk across the
population this works reasonably well, given a host of assumptions. It is a
mistake, however, in my mind, to think that the model reflects pathogenic
mechanisms in individuals.
Brilliant! A must read for those working with model organisms and for human geneticists to learn how to communicate better with society,
ReplyDeleteCheers
Vijay