Re: Information: Brad's reply (was Information: a very

Greg Billock (billgr@cco.caltech.edu)
Mon, 29 Jun 1998 14:02:15 -0700 (PDT)

Glenn Morton:

[...]

> I have a question. I don't want to miss something here. It seems to me =
> that the code we need to deal with is the DNA code. We know how many =
> characters are in the DNA language, 4, and given any length of DNA, we can =
> calculate exactly how many permutations there are. We can't tell how many =
> of those permutations would create a living being, of course(and I agree =
> with you that the percentage of all possible sequences that would create a =
> living being is quite small indeed), but we do know precisely how many =
> permutations there are in any given sequence. So, what exactly do we not =
> know which prevents us from using H=3D-k p[I]log(p[I]) to calculate the =
> informational content of a given string of DNA? We know k, we know the =
> p[I]'s and we understand logarithms. and that is all we need to know, =
> isn't it?

Not really. Like Brad was saying, we can squeeze text a lot more if we
know it is English than if we don't. That is, calculating the entropy
of an English sentence isn't as easy as taking letter frequency tables
and adding logs. This is the same (only moreso) for the DNA "language."
We can be confident that there are tight constraints on DNA sequences
found in organisms (speaking about coding DNA here, of course, as we
have been), but we're not sure what they are. What this means is that
a DNA strand doesn't give us as much information as we'd get with the
SUM(p log(p)) measure.


> >Nobody knows what that fraction is, or even where exactly the most
> >crucial parts are. So similar to the "0" message, it is a hard
> >problem to try to figure out what information theory might do for us
> >in biology.=20
>
> The only way I see that the '0' message applies to DNA is if you are =
> suggesting that there are other nucleotides of which we are unaware. As =
> you suggested with the '0' message, "Since there is basically no way to =
> detect what a one-shot information source could have done, there is
> no way to model the distribution of its possible messages, and consequentl=
> y no way to figure out how much information was gotten. Was I restricting =
> myself to {0, 1}? to numerals? to one-digit
> keyboard taps? with some weird probability distribution on them? unrestrict=
> ed length?"
>
> But with DNA, we have only 4 nucleotides and thus KNOW what we are =
> restricted to. This should make the information content calculable. What =
> am I missing here?

The other 2,999,999,999 nucleotides in the sequence. :-) Remember, the
information content of a message is just how many bits of information it
gives you that you didn't already know. Since you know that
(AAAA .... (2,999,999,990 'A's) ..AAA) is not a real genome of a real
organism, knowing the *actual* sequence gives you a bit less information
than if it were. Supposing there were only 2^1,000,000 possible
organism genomes instead of 2^6,000,000,000, then knowing the actual
genome would give you four orders of magnitude less information than
you would get using the simplest analysis.

-Greg