Re: pure chance

Brian D. Harper (harper.10@osu.edu)
Sat, 11 Jan 1997 21:50:23 -0500

At 01:26 PM 1/10/97 -0800, Greg wrote:

[...]

>
>Hmmm. This seems like an interesting line of inquiry.
>The numbers you gave below correspond to a loss of about 1.5
>bits for each codon->amino acid (de)coding operation. I'm
>not sure I understand what you meant in the last part here. Are
>you saying that this 1.5 bits is like a 'threshold' that makes
>information incapable of moving in reverse (from proteins to DNA)?
>

Let me try to answer using an analogy Yockey likes to use.
It's easier to deal with since we can easily calculate all the
probabilities. Suppose we toss a pair of fair dice and consider
an alphabet consisting of all the possible ordered pairs. By
ordered pairs I mean that the event 3,4 is different from 4,3.

Now, this alphabet contains 36 letters and since the die are
fair, all letters occur with equal probability. We'll refer to this
alphabet as A with elements x (the 36 letters). The Shannon
entropy for this case is:

H(x) = - SUM|1..36|[ (1/36)*Log2(1/36)] = 5.17 bits

Now, our code is that as the dice are cast, we record only
their sum. The sums are recorded in a new alphabet B
with eleven elements y [2,3,4,5,6,7,8,9,10,11,12], however,
these elements are not all equally probable since there
is only one way to get a 2 and 12, but 6 ways to get a 7.
Since we can easily compute the probability of each event
(i.e. 1/36 for a 2 or a 12; 1/6 for a 7 etc.) its no trouble to
calculate the Shannon entropy

H(y) = - SUM|j=1..11|[ pxj*Log2(pxj)] = 3.27 bits

where pxj are the 11 probabilities for the 11 elements.

This is a simple analogy but has a lot of features in common
with DNA-->Protein. Here we can easily see how the information
was lost. We have no idea which of the six ordered pairs was
actually thrown when we receive the letter 7. We also see why
we cannot reverse the process, knowing only that the sum is
7 we have no way of deciding which of the 6 ordered pairs to
assign.

Greg:===
>If I were making a guess (and that's what it would have to be), I
>would say the reason for unidirectionality is the protein folding
>problem--proteins are made linearly, then fold up, and they are
>stable folded up, and so can't unfold, and there aren't any enzymes
>to do reverse coding from the tertiary structure (I'm not sure if there
>are any from the primary structure or not...). Since all the DNA-relevant
>information (or most of it) is hidden inside the scrunched-up protein,
>the DNA information that went in can never be gotten out. How is
>this guess related to the information direction? Can it be rephrased
>in information-theoretic terms (less information in the surface topography
>of a protein than in its internal structure, or something?), or does
>info theory help us here?
>

My suspicion is that what you are giving is a physical reason
why reversibility would not occur even if it were "possible"
according to info-theory.

In his book, Yockey mentions a case where the information
flow is reversible. I don't know enough about mol. biol. to
make much sense of it but perhaps you can:

"In the retroviral case, where the two alphabets have the
same entropy, information may flow in either direction
in the DNA-mRNA encoding if the process is catalyzed
by a reverse transcriptase."

[...]

BH:==
>> Sorry about the confusion. I personally have learned a
>> great deal from this exercise. Let's continue on
>> with these ideas to see some further useful insights
>> from info-theory.
>
Greg:==
>Uh, oh, now I'm partially mad. :-) :-) I thought I had
>persuaded you that the assumption of equal probabilities
>for all the codons was a poor estimate of the ensemble (or
>at least *could* be a poor estimate, and so needed some
>justification). Is this what you are going back on?
>

Wow, I guess I must have really mis-communicated some
how. When I computed the maximum entropy for DNA
assuming equal probabilities this was done just as a
rough estimate purely for convenience since I didn't
want to enter in the 61 probabilities given in the paper.
What the authors did was try to estimate the probabilities
based on the information available at that time, but they
definitely were not equal.

[...]

>BH:===
>> First of all, the redundancy in the genetic code that
>> results from the source having a larger alphabet than
>> the receiver is often referred to in a pejorative sense
>> as degeneracy. First, we see from the above that this
>> is necessary to guarantee a unidirectional transfer of
>> information. Yockey also points out that redundancy is
>> crucial for error correcting.
>
Greg:==
>It is also a kind of ECC--that is, degeneracy like this
>is exactly what ECCs rely on to decrease probability of
>error. You make the codewords in clusters so that their
>Hamming distance is larger than if you used densely packed
>codewords. I think it would be interesting to see if the
>genetic code were optimal in some sense in this way. It
>should be fairly easy to figure out--do you know if anyone
>has done so?
>

Actually Yockey does state that the genetic code was shown
to be optimal in a paper published in 1985.

>Yockey:
>[...]
>> code has this property, which results in the source
>> alphabet having a larger entropy than the receiving
>> alphabet. No code exists such that messages can be sent
>> from the receiver alphabet B to the source in alphabet
>> A if the alphabet a has a larger entropy. Information
>> cannot be transmitted from protein to DNA or mRNA for
>> this reason. On the other hand, the information capacity
>> of DNA may be used more fully by recording genetic messages
>> in two or even three reading frames. Thus the phenomenon
>> of the overlapping genes is related mathematically to
>> the Central Dogma and the redundance of the genetic code.
>> ======================================================
>
Greg:==
>I think I see what Yockey is getting at here, but the difference
>in max entropies is not a sufficient reason for the Central
>Dogma. One can translate back and forth between two codes even
>of very different entropy per symbol, and the process quickly
>reaches a limit cycle. For example, suppose my code was that
>vowels go to '1' and consonants go to '2'. Translating from
>English I get
>
>compressthissentence -> 21222122221221221221
>
>Obviously, I can't recover the original information in the
>sentence, but for now I don't care--the functional information I'll
>assume is in the 1s and 2s. So now I do a reverse translation, using
>'a' for 1 and 'b' for 2: babbbabbbbabbabbabba.
>
>Further codings and decodings will just map between the 'a' and 'b'
>string and the '1' and '2' string.
>

But you are really not translating back and forth between two
alphabets with different entropies. When you say "Obviously, I
can't recover the original information in the sentence" you concede
that the process is irreversible. Whether or not you care is
irrelevant ;-). When you reach your final state you are translating
reversibly back and forth, but between two alphabets with the
same number of characters (1,2) (a,b).

Greg:==
>So the fact that DNA has more information than protein is *evidence*
>that the Central Dogma is true; it is not *the reason that* the Central
>Dogma is true. This is especially the case in the DNA/protein system,
>where the fact that DNA has higher information capacity is what I
>understand drives neutralism. The fact that most mutations either have
>no effect at all upon protein, or, if they do have an effect, that
>effect is irrelevant to the function of the proteins, means that the
>DNA can move towards a condition of maximum entropy within the constraints
>of the proteins needed for duplication. This is the basis for all the
>'molecular clock' studies. I don't think Yockey's statement here is
>correct, at least as far as I understand it. It is true in a sense,
>but it is a sense that makes no difference to biology.
>

I really appreciate the above comments. One thing I'm sorely lacking in
is an understanding of molecular biology. I'm glad there's some one
around who knows stuff like this and is willing to discuss it. Yockey
does go into some of the stuff you mention above, for example molecular
clocks and molecular phylogenies. It would be nice if you could get a
hold of a copy of his book so that I don't have to spend so much time
translating. Or if you can't find that, maybe you have access to Journal
of Theoretical Biology?

Brian Harper
Associate Professor
Applied Mechanics
Ohio State University