Re: pure chance

billgr@cco.caltech.edu
Fri, 3 Jan 1997 10:39:04 -0800 (PST)

Brian Harper:


> Hopefully you'll read my reply to Gene as I think your main concern
> is that you're expecting information theory to address "information"
> in the more usual sense of the word. Remember Yockey's illustration
> about being unable to send Manfred Eigen a package labeled "gift"
> since gift means poison in German. In your example *uestion, you
> realize q is appropriate because you have English words in mind.
> If you showed the same thing to someone knowing only Chinese
> they would have no clue about q. The clue comes from an
> understanding about arrangements of letters, this understanding
> is not contained in the letters themselves.

I'm open to the idea that I didn't explain what I was thinking clearly
enough, or, even (!), that I was a bit muddled. In my reply to Glenn,
I hope I was more clear about the role the *ensemble* has in information--
namely, it is all there is. From what I could tell in the paper you
summarized, the authors treated the DNA as a string of codons. All
well and good, but why codons? Why didn't they treat the bases individually?
Why didn't they treat groups of 100 codons? Why didn't they treat the
entire chromosome as one giant exemplar of a huge ensemble of all possible
chromosome sequences, all with their associated probabilities? If, indeed,
DNA has no long-range correlations, so that treating its codons as more-or-
less independent makes sense, then the authors' assumption that the ensemble
for the DNA strand is a joint ensemble between all the constituent codons
is a good one. If DNA *does* have long-range correlations, then it is a
bad one.

Here's yet another (possibly bad) example. Suppose we have a string of
bases, with C always following A. So:

TACGGTGACTGACACTG

and so forth. In this case no information is conveyed by writing down the
'C's that follow the 'A's, because the ensemble of possibilities only has
one element. If we are going to mutate a C, though, then we are obviously
going to have a strand with *more* information than the original, since once
we know what the mutation was *to*, we'll have resolved an ensemble of
{A,G,C,T}. This is completely independent of any frequency of occurence
of C with respect to any other base. (Use codons to break down the argument
of the authors you quoted.) What makes it work is the 'long' range order
(two bases in this case). This is an extreme case--correlations can make
intermediate results, and order more long-range than a couple of bases may
come into play, and have to be played off against shorter-range order. Really
messy.

> functionality of a protein or DNA sequence then I think you
> will not consider it of much use. Functionality would be analogous
> to the meaning of the word "gift" in the illustration above. But
> one doesn't need info-theory for this. There are other methods
> available for determining functionality.

The functionality question certainly gets involved here. If we
were knowledgeable enough, we could write down a prior probability
distribution for a critter knowing, for example, that it flies, is
10 cm long, and eats bugs. Then it is meaningful to speak about the
amount of information conveyed by its genome. Given our current
state of knowledge, I think that our priors are pretty uniform, so
we pretty much get 4 Log(length) :-).

-Greg