How much innate knowledge can the genome encode?
In a recent debate between Gary Marcus and Yoshua Bengio about the future of Artificial Intelligence, the question came up of how much information the genome can encode. This relates to the idea of how much innate or prior “knowledge” human beings are really born with, versus what we learn through experience. This is a hot topic in AI these days as people debate how much prior knowledge needs to be pre-wired into AI systems, in order to get them to achieve something more akin to natural intelligence.
Bengio (like Yann leCun) argues for putting as little prior knowledge into the system as we can get away with – mainly in the form of meta-learning rules, rather than specific details about specific things in the environment – such that the system that emerges through deep learning from the data supplied to it will be maximally capable of generalisation. (In his view, more detailed priors give a more specialised, but a more limited and possibly more biased machine). Marcus argues for more prior knowledge as a scaffold on which to efficiently build new learning. He points out that this is exactly how evolution has worked – packing that kind of knowledge into the genomes of different species.
Of course, the question is how that works.
How does information in the genome lead to pre-wiring of the nervous system in
a way that automatically connects external stimuli to internal values or
associations?
In the course of a brief discussion on this
point, Bengio argued that there is not, in fact, enough information in the
genome to encode much prior knowledge: “…there’s
lots of room in the genome but clearly not enough to encode the details of what
your brain is doing. So, it has to be that learning is explaining the vast
majority of the actual computation done in the brain, just by counting
arguments: twenty thousand-odd genes, with a hundred billion neurons and a
thousand times more connections”.
Marcus responded with a reference to his excellent
book “The Birth of the Mind”, in which he refutes this “genome shortage
argument”, which he suggests people take as implying that most of our knowledge
is learned and that there simply isn’t enough information in the genome to
specify a lot of priors.
Unfortunately, their discussion veered off
this topic pretty quickly, but it’s a very interesting question and, in my
view, the way that Bengio phrased it – as a simple numerical argument –
completely misconceives the way in which the kind of information we are after
is encoded in the genome.
He is not alone. This discussion harks back
to an issue that exercised many people in the biological community when the
Human Genome Project completed the first draft of the human genome and it turned
out that we had far fewer genes than had previously been thought. Previous
estimates had put humans at about 100,000 genes (each one coding for a
different protein), based on extrapolation from a variety of kinds of data, the
details of which are not really important. When it turned out we have only
about 20,000 genes, there were gasps of horror, as this number was not even
twice as many as lowly fruitflies or even vastly simpler nematodes, which each
have ~15,000 genes.
Nematodes only have ~1,000 cells, ffs! If
we are so complex (and we are, if nothing else, stubbornly impressed with
ourselves), how can we have only on the same order of genes as a crappy little
worm that doesn’t even have a brain? And, as Bengio and others argue, if we
have only this limited number of genes, how can we encode any kind of
sophisticated priors in the connectivity of our brains, which is mathematically-speaking
vastly more complicated?
Counting
the wrong things
To me, this kind of thinking misconceives
entirely the way that developmental information is encoded in the genome. It’s
not the number of genes that matters, it’s the way that you use them. From a
developmental perspective, the number of proteins encoded is not the measure of
complexity. The human genome encodes almost exactly the same set of proteins as
every other mammal, with few meaningful differences in the actual amino acid
sequences. In fact, the biochemical sequence and functions of most proteins are
in many cases conserved over much larger evolutionary distances – from insects to
mammals, for example.
As one famous experiment illustrates (and
many others have found) it is often possible to substitute a fly version of a
protein with the mouse version and have it function perfectly normally. The experiment I’m thinking of was looking at a gene called eyeless in flies (because mutant flies lacking the gene have no
eyes) and Pax6 in mice. Eyeless/Pax6 is known as a “master
regulator” gene – it encodes a transcription factor that regulates a cascade of
other genes necessary for eye development, in both flies and mice.
If you drive expression of the Eyeless
protein in parts of the developing fly that give rise to other appendages like
antennae, legs, or wings, you can convert those appendages into eyes as well.
Amazingly, if you express the mouse version, Pax6, in these tissues, the same
thing happens – you form extra ectopic eyes (fly eyes, not mouse eyes). So, the
differences in amino acid sequence between the proteins encoded by eyeless and Pax6 don’t make much of a difference to its function. What does
make a difference is the context in which it is expressed – the network of
genes that the protein regulates in each species and the subsequent cascading
developmental trajectory.
Expression of eyeless in other tissues drives formation of ectopic eyes |
This relies not on the protein-coding
sequences, but on the so-called regulatory sequences of the target genes. These
are short stretches of DNA, often adjacent to the protein-coding bits, which
act as binding sites for other proteins (transcription factors or chromatin regulators), which control when and where a protein is expressed – in which
cell types and at what levels.
From Innate |
These regulatory elements typically work in
a modular fashion and are much more evolvable than the sequences of the
regulatory proteins. The latter are highly functionally constrained because
each such protein regulates many targets and any change in its sequence will
have many, diverse effects. (This is probably why they work the same even
across distantly related species). By contrast, changing the binding site for
one protein that regulates gene A will change only that aspect of the
expression pattern of gene A which that protein regulates, leaving all other
aspects and all other genes unchanged.
The complexity of those regulatory sequences
is not at all captured by the number of genes. Indeed, it is not even captured
by the number of such elements themselves, as they interact in combinatorial
and highly context-dependent ways, in cooperation or competition with different
sets of factors in different cell types and at different stages of development.
(For example, Pax6 in mammals is also involved in patterning the developing neocortex and in specifying a number of different cell types in the developing
spinal cord, in combination with other transcription factors). The complexity
of the output thus should not be expected to scale linearly with the number of
genetic elements.
Being amazed that you can make a human with
only 20,000 genes is thus like being amazed that Shakespeare could write all
those plays with only 26 letters. It’s totally missing where the actual,
meaningful information is and how it is decoded.
How
do we measure information?
Part of the problem is in thinking of
Shannon information as the only measure of information content that is really
scientific (nicely described in James Gleick’s “The Information”). Shannon information
(named after Claude Shannon) simply relates to the efficiency of encoding
information for signal transmission. It is measured in bits, which correspond
to how many “yes or no” questions you would have to ask to derive the original
message that you want to transmit.
Importantly, this does not take into
account at all what the message means. Indeed, a purely random sequence of
letters has greater Shannon information than a string of words that make up a
sentence, because the words and the sentence have some higher-order patterns in
them (like statistics of letters that typically follow each other, such as a
“u” following a “q”), which can be used to compress the message. A random
sequence has no such patterns and thus cannot be compressed.
Thinking in those terms naturally leads to
the kind of “counting arguments” that Bengio makes. These seem to take each
gene as a bit of information, and ask whether there are enough such bits to
specify all the bits in the brain, usually taken as the number of connections. Obviously
the answer is there are not enough such bits. (There aren’t even enough bits if
you take individual bases of the genome as your units of information).
But the genome is not a blueprint. Bits of
the genome do not correspond to bits of the body (or bits of the brain). The
genome is much more like a program or an algorithm. And computer scientists
have a much better measure to get at how complex that program is, which is
known as algorithmic complexity (or Kolmogorov complexity). This is the length
of the shortest computer program that can produce the output in question.
How
complex is your code?
What we really would like to know is the
Kolmogorov complexity of the human brain – how complex does the developmental
program have to be to produce it? (More specifically, to get it to
self-assemble, given the right starting conditions in a fertilised egg). And,
most germane for this discussion, how complex would that program have to be to
specify lots of innate priors?
Speaking as a developmental neurobiologist,
if we knew the answers to those questions, we’d be done. We can’t figure out or
quantify how complex the developmental program is until we know what the
developmental program is (*but see note from Tony Zador, below). In fairness, we know a lot – the field is in some ways quite mature, at least in terms of major
principles by which the brain self-assembles based on genomic instructions. But
I think it’s also fair to say that we do not have a complete understanding of
the molecular logic underlying the guidance of the projections and the
formation of synaptic connections of even a single neuron (like this one, for
example) in any species. What we can say is that that logic is highly
combinatorial, meaning you can get a lot of complexity out of a limited set of
molecules.
And we are only beginning to understand how
instinct and innate preferences and behaviors are wired into the circuitry of the brain. There are growing numbers of examples of subtle genetic differences
that lead to specific differences in neural circuitry that explain differences
in behaviour. Some of these relate to differences in behaviour between closely
related species (such as monogamy vs polygamy, or geometry of tunnel building).
But probably the best-studied are sex differences within species, where a
subtle genetic difference between males and females (the activity of the Sxl
gene in flies or the presence of the SRY gene in mammals, for example) shifts
the developmental trajectory of certain circuits, and pre-wires preferences and
behaviours.
It’s an exciting time for this kind of
research, with lots of amazing new tools being focused on fascinating
biological questions in all kinds of species, but really this work is only just
beginning. Moreover, it typically focuses on how a difference in the genome leads to a difference in circuitry and innate behaviour. It doesn’t explain
the full program required to specify all of the circuitry on which any such
behaviour is built.
So, we simply can’t currently answer the
question of how complex the developmental program is and how much innate
knowledge or behaviour it could really encode because we just don’t know enough
about how the program works or how innate knowledge and behaviour is wired into
the brain. But I think we know enough to know that the number of genes is not
the number we’re after.
In fact, I’m not sure there is a number
that could capture what we’re after. I understand the urge to quantify the information
in the genome in some way, but it assumes there is some kind of linear scale.
For signal compression (Shannon information) or algorithmic length (Kolmogorov
complexity), you can generate such a number and use it to compare different
messages or programs. I don’t know that that’s the case for the complexity of
the brain or the complexity of the genome.
I’m willing to say that a human brain is
more complex than that of a nematode and the human genome probably has correspondingly
more information in it than the nematode genome. But does the human genome have
more information than the mouse
genome or the chimp genome? Or is it just different
information – qualitatively, but not quantitatively distinct? Would the
Kolmogorov complexity differ between the mouse genome and the human genome? Not
much, I’d wager, and I’m not sure what you’d learn by quantifying it.
The real problem is that none of those
measures capture what the information means, because the meaning does not
inhere wholly in the message – it depends on who is reading it and what else
they know. For example, here is a (very) short story written by Ernest
Hemingway:
For sale: Baby shoes, never worn.
It doesn’t have much Shannon information
and you could write a very brief program to reproduce it. But it’s freighted
with meaning for most readers.
What
does it all mean?
So maybe we should be asking: what is the
meaning encoded in the genome and how is that meaning interpreted and decoded
and ultimately realised? The impressive thing is that the interpretation is
done by the products of the genome itself, starting with the proteins expressed
in the egg, and then with the proteins that very quickly come to be produced
from the zygotic genome.
The only thing physical property the genome
has to work with to encode anything is stickiness. People often say that DNA is
a chemically inert molecule that doesn’t do anything by itself and this is
accurate in the sense that it does not tend to chemically react with or form
covalent bonds with other molecules. But it is functional in a different way –
it is an adsorption catalyst. It is a surface that promotes the interaction of
other molecules by bringing them in to close proximity.
The specific sequence of any stretch of DNA
determines which proteins will bind to it and with what affinity. If those
proteins are transcription factors or chromatin proteins then their binding may
regulate the expression of a nearby gene, as discussed above, by increasing or
decreasing the likelihood that the enzyme RNA polymerase will bind to the gene
and produce a messenger RNA molecule. And, in turn, the mRNA acts as an
adsorption catalyst, bringing transfer RNAs carrying specific amino acids into
adjacency so that a peptide bond can be formed between them, eventually forming
a specific protein. It all starts with stickiness.
Of course, what unfolds after that – after the maternally deposited proteins bind to the genome in the fertilised egg and the new zygote starts making its own proteins – seems almost miraculous. Those patterns of biochemical affinity embody a complex and dynamic set of feedback and feedforward loops, coordinating profiles of gene expression, breaking symmetries, driving cellular differentiation and patterning of the embryo. And the biochemical activities and affinities of the proteins produced then control morphogenesis, cell migration, and, in the developing nervous system, the extension and guidance of projections, and tendencies to make synaptic connections with other cell types.
So, the developing organism interprets its
own genome as it self-assembles. Pretty astounding, but that’s life
(literally). However, the ultimate interpreter of the meaning in the genome is
natural selection. The resultant organism has to be made with the right specifications,
within the right operating range so as to survive and reproduce, given the
right environmental conditions.
You don’t need to specify where every
synapse is to get the machine to work right. You just need to specify roughly
the numbers of different cell types, their relative positions, the other types
of neurons they tend to connect to, etc. The job of building the brain is
accomplished statistically, and, crucially, probabilistically. This is why
there is lots of variation in brain structure and function even between
monozygotic twins and why intrinsic developmental variation is such a crucial
(and overlooked) source of differences in people’s psychology and behaviour
(the subject of my book Innate).
Back
to A.I.
Okay, so, where does this leave the
question we started with? Is there enough information in the genome to specify
lots of innate priors? Based on what I’ve said above, I don’t think the number
of genes in the genome places a limit on this or is even an informative number.
While I have some sympathy with efforts to define what Tony Zador refers to as
a “genomic bottleneck”, I’m not convinced that quantifying it will be in any
way straightforward or necessarily useful.
It’s certainly true that there isn’t enough
information in the genome to specify the precise outcome of neural development,
in terms of the number and position and connectivity of every neuron in the
brain. The genome only encodes a set of mindless biochemical rules that, when
played out across the dynamic self-organising system of the developing embryo,
lead to an outcome that is within a range of operational parameters defined by
natural selection. But there’s plenty of scope for those operational parameters
to include all kinds of things we would recognise as innate priors. And there
is plenty of evidence across many species that many different innate priors are
indeed pre-wired into the nervous system based on instructions in the genome.
For A.I., it still may be best to try and
keep such priors to a minimum, to make a general-purpose learning machine. On
the other hand, making a true A.I. – something that qualifies as an agent – may
require building something that has to do more than solve some specific
computational problems in splendid isolation. If it has to be embodied and get
around in the world, it may need as much help as it can get.
*[Note from Tony Zador: The length of
the genome, which in humans is around 3 billion letters, represents an upper
bound on the Kolmogorov complexity. Only a fraction of that carries functional
information, however, so the upper KC value may be quite a bit lower than
that].
Comments
Post a Comment