How much innate knowledge can the genome encode?


In a recent debate between Gary Marcus and Yoshua Bengio about the future of Artificial Intelligence, the question came up of how much information the genome can encode. This relates to the idea of how much innate or prior “knowledge” human beings are really born with, versus what we learn through experience. This is a hot topic in AI these days as people debate how much prior knowledge needs to be pre-wired into AI systems, in order to get them to achieve something more akin to natural intelligence. 

Bengio (like Yann leCun) argues for putting as little prior knowledge into the system as we can get away with – mainly in the form of meta-learning rules, rather than specific details about specific things in the environment – such that the system that emerges through deep learning from the data supplied to it will be maximally capable of generalisation. (In his view, more detailed priors give a more specialised, but a more limited and possibly more biased machine). Marcus argues for more prior knowledge as a scaffold on which to efficiently build new learning. He points out that this is exactly how evolution has worked – packing that kind of knowledge into the genomes of different species.

Of course, the question is how that works. How does information in the genome lead to pre-wiring of the nervous system in a way that automatically connects external stimuli to internal values or associations?

In the course of a brief discussion on this point, Bengio argued that there is not, in fact, enough information in the genome to encode much prior knowledge: “…there’s lots of room in the genome but clearly not enough to encode the details of what your brain is doing. So, it has to be that learning is explaining the vast majority of the actual computation done in the brain, just by counting arguments: twenty thousand-odd genes, with a hundred billion neurons and a thousand times more connections”.

Marcus responded with a reference to his excellent book “The Birth of the Mind”, in which he refutes this “genome shortage argument”, which he suggests people take as implying that most of our knowledge is learned and that there simply isn’t enough information in the genome to specify a lot of priors.

Unfortunately, their discussion veered off this topic pretty quickly, but it’s a very interesting question and, in my view, the way that Bengio phrased it – as a simple numerical argument – completely misconceives the way in which the kind of information we are after is encoded in the genome.

He is not alone. This discussion harks back to an issue that exercised many people in the biological community when the Human Genome Project completed the first draft of the human genome and it turned out that we had far fewer genes than had previously been thought. Previous estimates had put humans at about 100,000 genes (each one coding for a different protein), based on extrapolation from a variety of kinds of data, the details of which are not really important. When it turned out we have only about 20,000 genes, there were gasps of horror, as this number was not even twice as many as lowly fruitflies or even vastly simpler nematodes, which each have ~15,000 genes.

Nematodes only have ~1,000 cells, ffs! If we are so complex (and we are, if nothing else, stubbornly impressed with ourselves), how can we have only on the same order of genes as a crappy little worm that doesn’t even have a brain? And, as Bengio and others argue, if we have only this limited number of genes, how can we encode any kind of sophisticated priors in the connectivity of our brains, which is mathematically-speaking vastly more complicated?


Counting the wrong things

To me, this kind of thinking misconceives entirely the way that developmental information is encoded in the genome. It’s not the number of genes that matters, it’s the way that you use them. From a developmental perspective, the number of proteins encoded is not the measure of complexity. The human genome encodes almost exactly the same set of proteins as every other mammal, with few meaningful differences in the actual amino acid sequences. In fact, the biochemical sequence and functions of most proteins are in many cases conserved over much larger evolutionary distances – from insects to mammals, for example.

As one famous experiment illustrates (and many others have found) it is often possible to substitute a fly version of a protein with the mouse version and have it function perfectly normally. The experiment I’m thinking of was looking at a gene called eyeless in flies (because mutant flies lacking the gene have no eyes) and Pax6 in mice. Eyeless/Pax6 is known as a “master regulator” gene – it encodes a transcription factor that regulates a cascade of other genes necessary for eye development, in both flies and mice. 

If you drive expression of the Eyeless protein in parts of the developing fly that give rise to other appendages like antennae, legs, or wings, you can convert those appendages into eyes as well. Amazingly, if you express the mouse version, Pax6, in these tissues, the same thing happens – you form extra ectopic eyes (fly eyes, not mouse eyes). So, the differences in amino acid sequence between the proteins encoded by eyeless and Pax6 don’t make much of a difference to its function. What does make a difference is the context in which it is expressed – the network of genes that the protein regulates in each species and the subsequent cascading developmental trajectory.
Expression of eyeless in other tissues drives formation of ectopic eyes


This relies not on the protein-coding sequences, but on the so-called regulatory sequences of the target genes. These are short stretches of DNA, often adjacent to the protein-coding bits, which act as binding sites for other proteins (transcription factors or chromatin regulators), which control when and where a protein is expressed – in which cell types and at what levels.

From Innate
These regulatory elements typically work in a modular fashion and are much more evolvable than the sequences of the regulatory proteins. The latter are highly functionally constrained because each such protein regulates many targets and any change in its sequence will have many, diverse effects. (This is probably why they work the same even across distantly related species). By contrast, changing the binding site for one protein that regulates gene A will change only that aspect of the expression pattern of gene A which that protein regulates, leaving all other aspects and all other genes unchanged.

The complexity of those regulatory sequences is not at all captured by the number of genes. Indeed, it is not even captured by the number of such elements themselves, as they interact in combinatorial and highly context-dependent ways, in cooperation or competition with different sets of factors in different cell types and at different stages of development. (For example, Pax6 in mammals is also involved in patterning the developing neocortex and in specifying a number of different cell types in the developing spinal cord, in combination with other transcription factors). The complexity of the output thus should not be expected to scale linearly with the number of genetic elements.

Being amazed that you can make a human with only 20,000 genes is thus like being amazed that Shakespeare could write all those plays with only 26 letters. It’s totally missing where the actual, meaningful information is and how it is decoded.


How do we measure information?

Part of the problem is in thinking of Shannon information as the only measure of information content that is really scientific (nicely described in James Gleick’s “The Information”). Shannon information (named after Claude Shannon) simply relates to the efficiency of encoding information for signal transmission. It is measured in bits, which correspond to how many “yes or no” questions you would have to ask to derive the original message that you want to transmit.

Importantly, this does not take into account at all what the message means. Indeed, a purely random sequence of letters has greater Shannon information than a string of words that make up a sentence, because the words and the sentence have some higher-order patterns in them (like statistics of letters that typically follow each other, such as a “u” following a “q”), which can be used to compress the message. A random sequence has no such patterns and thus cannot be compressed.

Thinking in those terms naturally leads to the kind of “counting arguments” that Bengio makes. These seem to take each gene as a bit of information, and ask whether there are enough such bits to specify all the bits in the brain, usually taken as the number of connections. Obviously the answer is there are not enough such bits. (There aren’t even enough bits if you take individual bases of the genome as your units of information).

But the genome is not a blueprint. Bits of the genome do not correspond to bits of the body (or bits of the brain). The genome is much more like a program or an algorithm. And computer scientists have a much better measure to get at how complex that program is, which is known as algorithmic complexity (or Kolmogorov complexity). This is the length of the shortest computer program that can produce the output in question.

How complex is your code?

What we really would like to know is the Kolmogorov complexity of the human brain – how complex does the developmental program have to be to produce it? (More specifically, to get it to self-assemble, given the right starting conditions in a fertilised egg). And, most germane for this discussion, how complex would that program have to be to specify lots of innate priors?

Speaking as a developmental neurobiologist, if we knew the answers to those questions, we’d be done. We can’t figure out or quantify how complex the developmental program is until we know what the developmental program is (*but see note from Tony Zador, below). In fairness, we know a lot – the field is in some ways quite mature, at least in terms of major principles by which the brain self-assembles based on genomic instructions. But I think it’s also fair to say that we do not have a complete understanding of the molecular logic underlying the guidance of the projections and the formation of synaptic connections of even a single neuron (like this one, for example) in any species. What we can say is that that logic is highly combinatorial, meaning you can get a lot of complexity out of a limited set of molecules.

And we are only beginning to understand how instinct and innate preferences and behaviors are wired into the circuitry of the brain. There are growing numbers of examples of subtle genetic differences that lead to specific differences in neural circuitry that explain differences in behaviour. Some of these relate to differences in behaviour between closely related species (such as monogamy vs polygamy, or geometry of tunnel building). But probably the best-studied are sex differences within species, where a subtle genetic difference between males and females (the activity of the Sxl gene in flies or the presence of the SRY gene in mammals, for example) shifts the developmental trajectory of certain circuits, and pre-wires preferences and behaviours.

It’s an exciting time for this kind of research, with lots of amazing new tools being focused on fascinating biological questions in all kinds of species, but really this work is only just beginning. Moreover, it typically focuses on how a difference in the genome leads to a difference in circuitry and innate behaviour. It doesn’t explain the full program required to specify all of the circuitry on which any such behaviour is built.

So, we simply can’t currently answer the question of how complex the developmental program is and how much innate knowledge or behaviour it could really encode because we just don’t know enough about how the program works or how innate knowledge and behaviour is wired into the brain. But I think we know enough to know that the number of genes is not the number we’re after.

In fact, I’m not sure there is a number that could capture what we’re after. I understand the urge to quantify the information in the genome in some way, but it assumes there is some kind of linear scale. For signal compression (Shannon information) or algorithmic length (Kolmogorov complexity), you can generate such a number and use it to compare different messages or programs. I don’t know that that’s the case for the complexity of the brain or the complexity of the genome.

I’m willing to say that a human brain is more complex than that of a nematode and the human genome probably has correspondingly more information in it than the nematode genome. But does the human genome have more information than the mouse genome or the chimp genome? Or is it just different information – qualitatively, but not quantitatively distinct? Would the Kolmogorov complexity differ between the mouse genome and the human genome? Not much, I’d wager, and I’m not sure what you’d learn by quantifying it.

The real problem is that none of those measures capture what the information means, because the meaning does not inhere wholly in the message – it depends on who is reading it and what else they know. For example, here is a (very) short story written by Ernest Hemingway:

For sale: Baby shoes, never worn.

It doesn’t have much Shannon information and you could write a very brief program to reproduce it. But it’s freighted with meaning for most readers.


What does it all mean?

So maybe we should be asking: what is the meaning encoded in the genome and how is that meaning interpreted and decoded and ultimately realised? The impressive thing is that the interpretation is done by the products of the genome itself, starting with the proteins expressed in the egg, and then with the proteins that very quickly come to be produced from the zygotic genome.

The only thing physical property the genome has to work with to encode anything is stickiness. People often say that DNA is a chemically inert molecule that doesn’t do anything by itself and this is accurate in the sense that it does not tend to chemically react with or form covalent bonds with other molecules. But it is functional in a different way – it is an adsorption catalyst. It is a surface that promotes the interaction of other molecules by bringing them in to close proximity.

The specific sequence of any stretch of DNA determines which proteins will bind to it and with what affinity. If those proteins are transcription factors or chromatin proteins then their binding may regulate the expression of a nearby gene, as discussed above, by increasing or decreasing the likelihood that the enzyme RNA polymerase will bind to the gene and produce a messenger RNA molecule. And, in turn, the mRNA acts as an adsorption catalyst, bringing transfer RNAs carrying specific amino acids into adjacency so that a peptide bond can be formed between them, eventually forming a specific protein. It all starts with stickiness.

Of course, what unfolds after that – after the maternally deposited proteins bind to the genome in the fertilised egg and the new zygote starts making its own proteins – seems almost miraculous. Those patterns of biochemical affinity embody a complex and dynamic set of feedback and feedforward loops, coordinating profiles of gene expression, breaking symmetries, driving cellular differentiation and patterning of the embryo. And the biochemical activities and affinities of the proteins produced then control morphogenesis, cell migration, and, in the developing nervous system, the extension and guidance of projections, and tendencies to make synaptic connections with other cell types.

So, the developing organism interprets its own genome as it self-assembles. Pretty astounding, but that’s life (literally). However, the ultimate interpreter of the meaning in the genome is natural selection. The resultant organism has to be made with the right specifications, within the right operating range so as to survive and reproduce, given the right environmental conditions.

You don’t need to specify where every synapse is to get the machine to work right. You just need to specify roughly the numbers of different cell types, their relative positions, the other types of neurons they tend to connect to, etc. The job of building the brain is accomplished statistically, and, crucially, probabilistically. This is why there is lots of variation in brain structure and function even between monozygotic twins and why intrinsic developmental variation is such a crucial (and overlooked) source of differences in people’s psychology and behaviour (the subject of my book Innate). 


Back to A.I.

Okay, so, where does this leave the question we started with? Is there enough information in the genome to specify lots of innate priors? Based on what I’ve said above, I don’t think the number of genes in the genome places a limit on this or is even an informative number. While I have some sympathy with efforts to define what Tony Zador refers to as a “genomic bottleneck”, I’m not convinced that quantifying it will be in any way straightforward or necessarily useful.

It’s certainly true that there isn’t enough information in the genome to specify the precise outcome of neural development, in terms of the number and position and connectivity of every neuron in the brain. The genome only encodes a set of mindless biochemical rules that, when played out across the dynamic self-organising system of the developing embryo, lead to an outcome that is within a range of operational parameters defined by natural selection. But there’s plenty of scope for those operational parameters to include all kinds of things we would recognise as innate priors. And there is plenty of evidence across many species that many different innate priors are indeed pre-wired into the nervous system based on instructions in the genome.

For A.I., it still may be best to try and keep such priors to a minimum, to make a general-purpose learning machine. On the other hand, making a true A.I. – something that qualifies as an agent – may require building something that has to do more than solve some specific computational problems in splendid isolation. If it has to be embodied and get around in the world, it may need as much help as it can get.


*[Note from Tony Zador: The length of the genome, which in humans is around 3 billion letters, represents an upper bound on the Kolmogorov complexity. Only a fraction of that carries functional information, however, so the upper KC value may be quite a bit lower than that].

Comments

Popular posts from this blog

Grandma’s trauma – a critical appraisal of the evidence for transgenerational epigenetic inheritance in humans

Undetermined - a response to Robert Sapolsky. Part 1 - a tale of two neuroscientists

Undetermined - a response to Robert Sapolsky. Part 2 - assessing the scientific evidence