Re: pure chance

Brian D. Harper (harper.10@osu.edu)
Thu, 09 Jan 1997 14:19:40 -0500

At 10:02 PM 1/6/97 +0000, Glenn wrote:

[...]

>
>I am going to throw another means of compression which apparently is rare but
>has apparently been observed. Overlapping genes! (Edward Yoxen, The Gene
>Business,Oxford U. Press, 1983, P.107) says that some virus's have overlapping
>genes. This is truly a compressed message and I am not sure how information
>theory could deal with this. An audio example of overlapping messages is;
>
>"The seamlessness of speach is also apparent in
>'oronyms,' strings of sound that can be carved into words in two different
>ways:
>
>The good can decay many ways.
>The good candy came anyways.
>
>The stuffy nose can lead to problems
>The stuff he knows can lead to problems
>
>Some others I've seen.
>Some motherse I've seen."
>
>~Steven Pinker, The Language Instinct, (New York: Harper/Perennial, 1994), p.
>159-160
>
>One string of sounds but two messages. Information theory has no way that I
>am aware of for describing this type of situation. Do you know of one?
>

Glenn keeps coming up with these interesting cases :).

I'm starting to wonder if I should drop out of discussions
of information theory all together. Every time I say
something it seems like I have to take it back a few
days later :). I looked up overlapping genes in Yockey's
book and found a lot of interesting stuff. But reading
through this interesting stuff I found that I had made
a mistake in my last post. I'm going to have to
think about it some more but it seems I'll probably
have to retract what I said previously, at least kind of.
Of course, this is probably going to get Greg really mad
at me :).

Greg has been asking about the usefulness of info-theory.
It looks like overlapping genes may be an example where
information theory is useful. Before getting into that
let me go over some of my musings regarding my last
post. To refresh our memories we had the situation where
several codons specify the same amino acid regardless of
the third position. I had taken this as an example of
the type thing Greg was talking about since a mutation
at one of these positions results in no change in the
information content of the protein. Glenn then asked whether
the information content in the DNA might change even
though the info-content of the protein remained the same.
The answer to this, after my further reading is apparently
yes. So good job to Glenn for asking the right question.

I got myself muddled by getting off into sources and receivers,
so lets go back over that business a little. One thing that
I hope is obvious is that one cannot receive more information
than is sent. One can however send more information than
is received. I was aware of this previously but wasn't
aware of how crucial the point is for the genetic info
system. I was thinking that the main reason for information
loss was noise, various types of mistakes in encoding and
decoding. It turns out that a lot of information will also
be lost if the source code has a larger alphabet than the
the receiving alphabet (which is the case for 61 codons in
DNA and 20 amino acids). This turns out to be crucial since
it is the information loss which guarantees that information
is transferred in only one direction, from DNA --> protein
but never protein --> DNA. We might then consider this a
useful contribution of information theory in that this
important principle of the genetic information system is
a direct consequence of the mathematics of coding theory.
Information transfer will always be unidirectional when
the entropy of the source exceeds the entropy of the
receiver.

I was previously thinking some along these lines but
messed myself up with the following considerations.
For the case of the redundant codons, a mutation in
the third position can either increase or decrease the
information content of the source. The information
content at the receiver remains the same in either
case. Thus, I was anticipating a case where the information
content at the source might drop below that at the receiver.
I'm thinking now that this possibility cannot occur due
to the large difference in information content between
the source (with 61 "letters") and the receiver (with
20 "letters"). A common mistake at this point is to
forget that unequal frequencies of the occurrence of
the letters decreases the information content. Nevertheless
we can still get a rough comparison between DNA and
protein info content by taking the maximum entropy
which always occurs when all letters appear at the
same frequency. For this case, the entropy of a sequence
of N characters is:

S = N*sum|1..P|[(1/P)Log2(1/P)]

Where P is the number of characters in the alphabet (61 for
DNA and 20 for protein) Thus, the max entropy is 5.93*N for
DNA and 4.32*N {bits per symbol} for protein.

Now for my retraction. Given the authors were discussing the
entropy at the source, I think their conclusions and math
is correct. The comments I made were, I think, correct also
except I was (unknowingly) discussing a different entropy,
the so-called mutual entropy, which is a measure
of the information being passed through a communication
system. The mutual entropy is defined as:

I(A;B) = H(x) - H(x|y)

Where I is the mutual entropy. A represents the source
alphabet with an ensemble of messages x. A similar
interpretation is given to B and y. H(x) is the entropy
of the source (what the authors of the study were analyzing).
H(x|y) is the conditional entropy that message xi was sent
given message yj was received. Basically, H(x|y) is a
measure of the information lost during the transmission
of the message. H(x|y) can be further split into two
components, one representing the information lost due
to noise and the other the information lost due to the
source having a larger alphabet than the receiver.

Sorry about the confusion. I personally have learned a
great deal from this exercise. Let's continue on
with these ideas to see some further useful insights
from info-theory.

First of all, the redundancy in the genetic code that
results from the source having a larger alphabet than
the receiver is often referred to in a pejorative sense
as degeneracy. First, we see from the above that this
is necessary to guarantee a unidirectional transfer of
information. Yockey also points out that redundancy is
crucial for error correcting.

Here's where one important implication of overlapping
genes comes to play. Yockey discusses this in detail,
mentioning that the phenomena is much more widespread
than was originally thought. He also mentioned that
the initial discovery of overlapping genes caused
considerable "angst" among molecular biologists to the
extent that one fellow even called for a redefinition of
the "... very concept of a gene."

This is really interesting. We see that there are certain
advantages to be gained by redundancy in the genetic code.
A disadvantage though is that some information must be
lost. But overlapping genes overcome, at least partially,
this disadvantage by increasing the information that can
be transferred.

Following is an extended quote from the epilogue of
Yockeys book (p. 339) which explains this stuff
probably much better than my rantings:

========= begin Yockey =================================

The fundamental axiom in molecular biology, which justifies
the application of information and coding theory, is the
_sequence hypothesis_. Although the sequence hypothesis is
peculiar to living systems, I have given considerable attention
to showing that the Central Dogma is a theorem from coding
theory and is not peculiar to biology (Yockey, 1974, 1978).
The fact that there must be a code between two sequences
with different alphabets, if information is to be transmitted
between them, follows from that axiom. There is no
_elan vital_ in biology so the genetic logic system must
obey the same fundamental mathematical and physical laws
as other logic operations. To believe otherwise is vitalism.
The irreversibility of the genetic logic operation is
unavoidable for two reasons. First, it is a consequence
of thermodynamic irreversibility that occurs when the
transition function does not have a single value. To allow
reversible operation in the several-to-one mapping case
the computer must record this information in its memory.
The genetic logic system has no such memory. Second,
irreversibility of information flow in the genetic logic
system is the result of the several-to-one mapping of
error detecting and error correcting codes. The genetic
code has this property, which results in the source
alphabet having a larger entropy than the receiving
alphabet. No code exists such that messages can be sent
from the receiver alphabet B to the source in alphabet
A if the alphabet a has a larger entropy. Information
cannot be transmitted from protein to DNA or mRNA for
this reason. On the other hand, the information capacity
of DNA may be used more fully by recording genetic messages
in two or even three reading frames. Thus the phenomenon
of the overlapping genes is related mathematically to
the Central Dogma and the redundance of the genetic code.
======================================================

Brian Harper
Associate Professor
Applied Mechanics
Ohio State University