Re: Information: Brad's reply (was Information: a very

Brad Jones (bjones@tartarus.uwa.edu.au)
Sat, 27 Jun 1998 17:06:50 +0800

>>------------------------------------------------------------
>>Glenn,
>>
>>I think there are two major problems with your reasoning.
>>
>>1. you are assuming that the information content of DNA is best
>>modeled as a information source. As I will show below this is not the
>>case and by doing so you will not produce any informative results.
>>
>>2. Your analysis of DNA as an information source is also incorrect
>>and your results are misleading.
>>
>>The main mistake that you made was to assume that DNA is a "zero
>>memory" source. A zero memory source outputs symbols that do not
>>depend on the previous symbols; this is not the case with DNA.
>>
>>A DNA sequence of AAAAATAAAA will output this each and every
>>time eg: AAAAATAAAA AAAAATAAAA AAAAATAAAA
>>AAAAATAAAA
>>
>>A zero memory source with the probabilities given would produce
>>something like: AATAAAAAAA AAAAATAAAA AAAAAAATAA
>>ATAAATAAAA
>
>
>I believe that you are mixing the way memory works in such systems. Zero
>memory usually applies to a Markov chain which doesn't use the previous
>character to determine the next.

A Markov source is defined as a source where the next symbol is dependent
on one or more previous symbols. A zero memory source is the opposite of
this. It is therefore impossible by definition to have a markov source where
the next symbol is not dependent on the previous.

> It is not the entire sequence that
>'memory' refers to. In English vowels are more likely to appear after a
>consonant. English is not a zero memory system since the previous
>character influences the next character.

'memory' refers to any type of memory whether it is one or multiple symbols,
this determines the order of the markov source.

Not sure what the reference to English achieves, it is well known that
English is not a zero memory source. An analysis of English (much the same
as I did for DNA) can be found in "Information Theory and Coding", Abramson
N, McGraw Hill 1963.

In this analysis the English _words_ were used as a superior model of the
source than the letters. This is exactly what I did with DNA and is valid
information theory technique. This text also shows the process of refining
the model by using markov source models and the resulting drop in entropy.

>q is always followed by u in
>English so when q appears, the choice of the next character is entirely
>constrained to u.

>>If you would like this mathematically then the analysis is as follows:
>>
>>*note that log to base 2 is used to give the result in bits*
>>
>>sequence: AAAAATAAAA AAAAATAAAA AAAAATAAAA
>>AAAAATAAAA
>>zero memory:
>>P(A) = 0.9, P(T) = 0.1
>>H(S)= sum( p(i)*log(1/p(i)) )
>> =0.47 bits
>>
>>2nd extension:
>>using groups of two: AA AA AT AA AA ...
>>P(AA) = 4/5 P(AT) = 1/5
>>H(S^2) = 0.72
>>H(S) = 0.36 bits
>>
>>The 2nd extension has a lower information content than the zero
>>memory, this indicates that the source is not a zero memory source.
>>
>>You can extend this by taking the 3rd, 4th etc and the information
>>content will continue to drop. eventually the information rate will
>>drop to zero as the 10th extension shows:
>>
>>AAAAATAAAA AAAAATAAAA AAAAATAAAA
>>AAAAATAAAA
>>P(AAAAATAAAA) = 1
>>H(S) = 0
>>
>>
>>
>You are using a 10th degree Markov chain and that is not what DNA is. Brian
>would you care to comment on this?

I just proved that it is at least a 10th order and you claim it isn't with
no supporting evidence (let alone mathematical analysis!) ?!?

as a side note the 10th extension is not the same as a 10th order markov
source. the 10th order markov source calculations are significantly more
complex and would be too difficult to present in text format.

What the 10th extension gives is a better approximation of the true entropy
of the source than the 1st extension. A 10th order Markov would give the
actual entropy of the source. in this case they are identical due to the
properties of the source, which is H(S) = 0.

A law of Info theory states that if a source is a zero memory source then
the nth extension will give the same entropy for all n.

or basically the higher order model used the better (lower) the entropy will
be. The limit that higher order model approach is the actual entropy of the
source, this means that once the entropy doesn't change as the order is
increased then the true source entropy has been obtained.

an example of this is given below where for a true zero memory source the
extension will not produce a reduction in entropy (unlike my previous
analysis).

zero memory P(0)=0.25, P(1) = 0.75
therefore H(S) = 0.81

1st order of a zero memory
P(00) = P(0)P(0) *from independence
= 0.06
P(01) = 0.19
P(10) = 0.19
P(11) = 0.56

H(S^2) = 1.62
H(S) = 0.81

As this simple example shows the entropy of the nth extension is equal to
the zero memory entropy if and only if the symbols are independent.

So, the result of the 10th extension that I gave proves that the DNA is not
a zero memory source.

**note that I do not actually think it is a source at all but rather a
channel, I was just showing the error in your analysis here**

>>From this it would be concluded that the source is actually a markov
>>source not a zero memory source.
>
>
>A zero memory Markov chain is what a random sequence is.

There is no such thing as a zero memory markov chain. But a random sequence
is a zero memory source, this has no relevance to the topic though.

>>The general rule is that any information source that repeats a
>>sequence will have zero information. In plain english this is:
>>
>>"If we already know what is being sent then there is no information
>>being sent"
>>
>>btw this is the basic principal that all compression software relies on.
>>
>>So you can see that modeling DNA as a information source tells us
>>nothing at all.
>>
>>The method I would use to model DNA is as an information channel.
>>This is due to DNA being equivalent to a storage device for
>>information (exactly like a CD etc) It can be played as many times as
>>it likes but it is not _creating_ any information each time it is played.
>
>
>Maybe you should tell this to Hubert Yockey. But I don't think he would
>agree with you. By the way, DNA is not like a CD. There are mutations in
>DNA and they can add information.

I don't know who Yockey is but this is what Engineers are taught the world
over in relation to information theory since Shannon and Wiener invented it
in 1948.

reasons why DNA is similar to a CD in terms of info theory:

1. is a channel for encoded information.

2. outputs a set sequence of codes repeatedly.

3. can be replicated.

4. random errors/mutations can occur in replication process.

on what grounds do you object to this comparison?

If it is only the fact that mutations (in your opinion) add information
where errors in CD replication does not. Well, this is precisely the point
being debated and so it is obviously not a valid argument.

Do you have any objections actually based on why they are different in
relation to information theory?

Is anyone interested what a realistic analysis of the problem would show if
done correctly as an information channel?

>>The mutations of DNA seem analogous to the errors encountered
>>when copying a CD which is quite easily modeled by a correct
>>application of information theory.
>>
>>By doing it this way it is possible to model the random mutations and
>>the effect they have on the information, ie the difference they make to
>>the information content as opposed to the actual information content.
>>The measure of this is called the mutual information of a channel.
>>
>>I hope this clears it up somewhat, it is quite difficult to explain this
in
>>easy terms and I would recommend finding a good textbook if you
>>really want to pursue this.
>
>
>I was about to make the same recommendation to you.

Sorry Glenn but I know what I am talking about here.

Your posts give no positive evidence or details of how information can be
created by random mutations except for your initial post which I have
demonstrated the faults with.

Further more you have not refuted my calculations with anything but your
personal opinion which does not seem to be based on any knowledge of the
material you are discussing.

You misunderstand the basics of information theory if you think that random
noise consists of or can create information in any form whatsoever. The fact
that the next symbol is random does not imply in any way that it is random
values that are being produced. You should look into source coding to see
what is actually meant by the symbol probabilities of a source.

Information theory uses words such as random in a very different way than
what a layman means. For example If you are sending english down your modem
then I would model that as a random source, but it is NOT random noise that
is being sent, it is english text that any randomness would corrupt not
enhance. In fact, as common sense suggests, any random modification of the
signal will ALWAYS reduce the information.

The randomness is only what the person listening to the source sees, the
source itself (for example a person speaking) is not producing random
signals, it is just that the listener does not know exactly what you will
say next which is where the randomness comes in.

The more uncertain we are about what the person will say, the more
information that is being produced. BUT the person speaking is NOT producing
random noise or anything close, they are just producing a wider range of
sounds (symbols.

The comparison with random errors/mutations is not valid because in that
case the data is not just unknown to the listener, it is gibberish to the
sender as well.

Some textbooks that deal with information theory are as follows:

Haykin, S. "Communication Systems", 3rd Edition, John Wiley & Sons 1994
Jan C. A. van der Lubbe "Information Theory", Cambridge University Press,
1997
Abramson, N. "Information Theory and Coding", McGraw-Hill, 1963

Also of interest is the IEEE Info Theory website at:
http://it.ucsd.edu/

Additional information on Markov sources can be found in any mathematics
book that deals with random processes (this is quite heavy going if you are
not up on your probability theory).

--------------------------------------------
Brad Jones
3rd Year BE(IT)
Electrical & Electronic Engineering
University of Western Australia