|

DNA
is a reality beyond metaphor
David
Baltimore
The drumbeats
get louder as we approach the day when the first draft of the entire structure
of the human genome is to be announced. Pundits appear on television shows,
trying to tell the public what this means. Many are my good friends. But
I must tell their dirty little secret.
They are
not telling the whole story. They have all decided that the real meaning
of this achievement is so wrapped up in technical detail that the only
way to convey the truth is through metaphor. So they tell the world that
the genome is like a book, with words, sentences and chapters. Or they
say that it is the periodic table for biologists, assuming that laypeople
will have heard of this key organizing principle of chemistry. But these
and other metaphoric links miss the real story. The genome is like no
other object that science has elucidated. No mere tool devised by humans
has the complexity of representation found in the genome.
So let me
try the harder, but richer and more honest, approach. Let me try to explain
what the genome really is. To do that, I need to start at the chemical
level and then take on the more complex notion of coding. Structurally,
the genome is just a huge string of chemical units broken up quite arbitrarily
into anywhere from 3 to a few hundred individual packages, the chromosomes.
This one-dimensional string of linked chemical units is abbreviated, DNA,
and we need not worry about its detailed architecture. We can also ignore
the packaging and treat DNA as one long string of 3 billion units. When
we say that the genome has been sequenced, we mean that we know the chemical
composition of each of the units as they occur in the sequence. Actually,
at each of the 3 billion positions along the string there can be only
one of four chemical units, abbreviated as A, T, C and G. So the genome
is a string of these four units in some particular sequence that goes
on for 3 billion letters. The closest analogy is to a computer code which
is a gigantic string with only 2 letters at each position.
Why not then
be satisfied with this computer code metaphor? Mainly because the meaning
of a computer code is not common knowledge so the metaphor conveys very
little to many people. But also because the metaphor does not communicate
the richness of coding systems buried in the seemingly monotonous string
of letters.
Coding implies
a method of transforming the coded information into some useful form.
The computer in front of me now is doing this so effectively that I never
see the code. Similarly, the cells of the body can decode DNA so effectively
that until the 1950s no one knew that there were codes controlling
living systems. The remarkable thing about the DNA code is that it is
decoded in multiple ways, all interdigitated with each other in the string
of letters in DNA. The most discussed are coding regions that specify
the sequence of proteins. Proteins are the actual machines that do the
work of the body. The protein-coding regions are all that is captured
by most metaphors for DNA. DNA as a book implies that all DNA has are
letters that transform into words, the meaningful units of language, and
that words are like proteins. But the regions of human DNA that encode
proteins are only a few percent of the 3 billion-long string of letters.
Most of it does other things. What are these other things?
The DNA code
can specify the sites and nature of many different events. While we dont
know them all, there are easily 10s of others aside from the sequence
of proteins. For instance, DNA does not encode proteins directly, it uses
an intermediary chemical string called RNA. Each RNA encodes one protein
so an RNA is a form of packaging of the DNA string into meaningful, bite
size pieces. But then the DNA must have a code for where to start an RNA,
and where to end an RNA. The RNA is not used as a direct copy of DNA but
rather is processed by destroying parts of it, modifying other parts and
putting special structures at each end. There is code for each of these
events.
What reads
these codes for processing RNA? Protein machines do itthey interpret
the coded sequence and follow its instructions. Each modifying machine
carries its own decoders. In some cases they even carry very specific
pieces of RNA because the code is most easily read by another RNA. There
are from a few to many tens of processing codes associated with each gene
in the genome.
But that
is only the beginning of the story. Another whole family of decoders determine
which RNA is made in which cell of the body and at what amount. This is
probably the most important code of all because it is what specifies the
individual functions of the cells of the body. The red cells of our blood
carry oxygen to our tissues because the gene for the oxygen carrier, hemoglobin,
is copied into a huge amount of RNA in developing red cells. But it is
copied in no other cell because no other cell needs that protein. Thus,
there is machinery in a red blood cell and only there that can read the
code surrounding the hemoglobin gene. There are some 50,000 genes in the
genome and that story can be told for each one. Some make RNA in all the
cells of the body because they are "housekeeping genes". Others
are found in some types of cells but not others. The brain has probably
hundreds of different cell types and each must differ from the others
by the pattern of RNAs found in the cell.
We call the
code for determining which genes make which RNAs in which cells, the regulatory
information in the DNA. There is lots of it but it is hard to recognize
and it is a particular challenge to those trying to read the codes in
DNA to decipher the regulatory information.
Then there
are many other codes in our DNA. The number of cells in the body increases
when we grow by one cell dividing into two. When such cell division occurs,
it is crucial that each cell get a full complement of the DNA instructions.
So the DNA needs code to specify its own partition among the daughter
cells. Also, the DNA must duplicate itself. DNA is a double helix that
separates into its two constituent parts when it duplicates. A big protein
machine carries out that process and responds to coded information in
the DNA itself.
Basically,
DNA directs everything that happens in the cell and does it with a daunting
variety of interdigitated coded information. The sequence of the genome
lays bare all of that but only the continuing efforts of the thousands
of scientists trying to decipher the individual codes will bring out the
full richness of information hidden away in the string of units provided
by the sequencers. And, to make it all much harder, the meaningful coded
information is tucked away in a sea of parasitic DNA. Just as plants and
animals are infested with parasites, so is DNA. The parasites are DNA
themselves that can duplicate itself and insert throughout the DNA. They,
and other forms of DNA that are thought to be accidental junk, form the
vast majority of our genome. We would love to ignore it all but often
it is hard to tell important DNA from junk and there is always the suspicion
that we may have underestimated the importance of what we call junk.
Modern biology
is a science of information. The sequencing of the genome is a landmark
of progress in specifying the information, decoding it into its many coded
meanings and learning how it goes wrong in disease. While it is a moment
worthy of the attention of every human, we should not mistake progress
for a solution. There is yet much hard work to be doneeven the genome
we have today is a first draft that needs elaboration. It will be the
work of at least the next half-century to fully comprehend the magnificence
of the DNA edifice build over 4 billion years of evolution and held in
the nucleus of each cell of the body of each organism on earth.
|