Re: pure chance

Brian D. Harper (harper.10@osu.edu)
Sat, 04 Jan 1997 00:25:40 -0500

At 10:39 AM 1/3/97 -0800, Greg wrote:
>Brian Harper:
>
>
>> Hopefully you'll read my reply to Gene as I think your main concern
>> is that you're expecting information theory to address "information"
>> in the more usual sense of the word. Remember Yockey's illustration
>> about being unable to send Manfred Eigen a package labeled "gift"
>> since gift means poison in German. In your example *uestion, you
>> realize q is appropriate because you have English words in mind.
>> If you showed the same thing to someone knowing only Chinese
>> they would have no clue about q. The clue comes from an
>> understanding about arrangements of letters, this understanding
>> is not contained in the letters themselves.
>
>I'm open to the idea that I didn't explain what I was thinking clearly
>enough, or, even (!), that I was a bit muddled. In my reply to Glenn,
>I hope I was more clear about the role the *ensemble* has in information--
>namely, it is all there is.

Yes it does seem that I misunderstood you to a large degree, sorry
about that. Yes, you are absolutely right that the ensemble is
very important. Another useful interpretation of information
content a la Shannon is in terms of uncertainty. There is an
ensemble of possible messages. Until a message is received,
there is uncertainty as to which of the ensemble of possible
messages has been sent. There is an obvious correlation between
size of the ensemble and information. If only one message is
contained in the ensemble then there is no uncertainty removed
at the receiver and thus no information.

Greg:==
> From what I could tell in the paper you
>summarized, the authors treated the DNA as a string of codons. All
>well and good, but why codons? Why didn't they treat the bases individually?
>Why didn't they treat groups of 100 codons? Why didn't they treat the
>entire chromosome as one giant exemplar of a huge ensemble of all possible
>chromosome sequences, all with their associated probabilities? If, indeed,
>DNA has no long-range correlations, so that treating its codons as more-or-
>less independent makes sense, then the authors' assumption that the ensemble
>for the DNA strand is a joint ensemble between all the constituent codons
>is a good one. If DNA *does* have long-range correlations, then it is a
>bad one.
>

Well, I'm not quite sure what to say about this. My layman's understanding
suggests that codons were selected since they represent the most natural
choice for an "alphabet" from which all genetic messages would be
constructed. Also, it is specific codons which correspond to amino acids
as the genetic message is decoded. But there is no reason why one
could not construct other "alphabets" corresponding to groups of codons
as you suggest. This would give an entirely different probability space
and entropy. How useful this would be is another question that I
really can't address. The language analogy would seem to be having
one's "alphabet" corresponding to words in some language. This would,
of course, be more like a dictionary than an alphabet.

Come to think of it, I do recall an instance where something like this
was done in a paper which stirred up a lot of controversy. The title
was something like "Hints of a language in junk DNA". If memory serves,
the authors looked at the various possible "words" of various lengths
(i.e. lengths of codons). They then ranked the words in terms of length
and in terms of frequency of appearance in the junk DNA. Plotting on
a double log scale yielded a straight line, Zipf's law, a property associated
with all languages.

As I said, this work caused quite a stir, but if memory serves it was
subsequently discredited.

Greg:===
>Here's yet another (possibly bad) example. Suppose we have a string of
>bases, with C always following A. So:
>
>TACGGTGACTGACACTG
>
>and so forth. In this case no information is conveyed by writing down the
>'C's that follow the 'A's, because the ensemble of possibilities only has
>one element. If we are going to mutate a C, though, then we are obviously
>going to have a strand with *more* information than the original, since once
>we know what the mutation was *to*, we'll have resolved an ensemble of
>{A,G,C,T}. This is completely independent of any frequency of occurence
>of C with respect to any other base. (Use codons to break down the argument
>of the authors you quoted.) What makes it work is the 'long' range order
>(two bases in this case). This is an extreme case--correlations can make
>intermediate results, and order more long-range than a couple of bases may
>come into play, and have to be played off against shorter-range order. Really
>messy.
>

This is a great example. Yes, this throws all the authors conclusions
into doubt as far as I'm concerned. Actually, it is still possible to do what
the authors proposed to do (but correctly :) but it would take a lot more
effort.

The criticism you raise turns out to be exactly the criticism Yockey
levels at many published studies. It turns out to be a common
mistake, so I'm kicking myself that I didn't recognize it.

Yockey mentions exactly the type situation you are discussing above.

"The third position in eight of the familiar triplet codons in the
set space G^3 may vary indefinitely among the four nucleotides
in either DNA or RNA alphabets without changing the read off
amino acid. The specificity of these codons is defined by the first
two nucleotides"

With respect to the authors conclusions, it is obvious I think that a
mutation in the third position of these eight codons would not
change the information content regardless of the frequencies of
the codons involved.

As I mentioned above, it is possible for information theory to take
this type of situation into account and Yockey does give some
examples how to do it in his book. But the type of correlations
that can be accounted for in this way must be statistical in nature,
information theory still cannot address the meaning or value of
the message. For example, for the case Yockey mentions above
one has a set of constraints imposed on the possible messages
in the ensemble. These constraints reduce the uncertainty and
thus the information content.

Brian Harper
Associate Professor
Applied Mechanics
Ohio State University