Re: Information: Brad's reply (was Information: a very

Greg Billock (billgr@cco.caltech.edu)
Tue, 30 Jun 1998 10:20:44 -0700 (PDT)

Glenn,

[...]

> >> But with DNA, we have only 4 nucleotides and thus KNOW what we are =
> >> restricted to. This should make the information content calculable. What =
> >> am I missing here?
> >
> >The other 2,999,999,999 nucleotides in the sequence. :-)
>
> I think I must have miscommunicated here. The information content of a
> sequence is related to the ENTIRE sequence, all 2,999,999 of them. There
> are only 4 letters in the DNA alphabet. That was what I was meaning with
> the 4 nucleotides. I was NOT referring to a 4 nucleotide long DNA
> molecule. Go back and re read what I said in that light.
>
> I still don't think we need to know how many of the possible 3 billion long
> DNA chains yield life to calculate information.

I see what you mean, but here's the paragraph from Shannon again:

We can think of a discrete source as generating the message, symbol
by symbol. It will choose successive symbols according to certain
probabilities depending, in general, on preceding choices as well
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
as the particular symbols in question. A physical system, or a
mathematical model of a system which produces such a sequence of
symbols governed by a set of probabilities, is known as a stochastic
process.

When the system has dependencies like this (i.e. not every symbol is
independent of every other symbol), it is too aggressive to calculate
the information as 2*length (2 bits per base). The reason is because
there are long-range interdependencies in the genome.

-Greg