Re: Information: a very technical definition (was Dawkins'

Brian D Harper (bharper@postbox.acs.ohio-state.edu)
Tue, 16 Jun 1998 17:05:07 -0400

At 08:17 PM 6/15/98 -0500, Glenn wrote:

[...]

>>GM>If you start with a sequence of 11111111111 or in DNA AAAAAAAAAAA and
>>>then mutate it so that it is 11111211111 or in DNA AAAAATAAAAA you have
>>>actually increased the information.
>>

SJ:===
>>I have a son, Brad, who is studying InformationTechnology Engineering at
>university
>>and he is currently doing Information Theory. I showed him your message
>and he said
>>what you are saying here is "wrong" (his word). He gave me a very detailed
>explanation
>>why (which I did not fully understand), and even if I did it sounded to
>complex for me to
>>repeat!
>>
>

GM:==
>Brad better wait til he gets his degree. The equation I presented above
>shows why there is no information in a sequence like AAAAAAAAAA. There are
>no choices of characters so the probability of getting A is 100% or 1.
>Thus when you put that into the above equation
>
>H=-K (1) log(1).
>
>Well the log of 1 is ZERO, 0. This means that -K times 1 times zero is
>ZERO. NO information.
>
>In the case of AAAAATAAAA the probabilities of the characters are:
>
>P(A)=.9
>P(T)=.1
>
>Thus
>
>H=-K (.9log(.9)+.1log(.1))= -K(-.041-.1)=.14K.
>
>As long as K is not zero, which it isn't then the mutation Brad says
>doesn't increase information does exactly that. There is more information
>in the later case than in the former.
>
>This should be enough for now.
>

Information theory tends to get very confusing. One point
of possible confusion in recent discussions is the
simultaneous usage of two related but different theories
and definitions of information, classical information
theory a la Shannon and the more recent algorithmic
information theory of Kolmogorov and Chaitin.

If one is discussing info content in terms of compressibility
then algorithmic information theory is appropriate. So,
what I plan to do here is reach the same conclusion Glenn
does by a different route. This example will hopefully
illustrate an important point that is often missed. The
algorithmic information content is not given directly by
the compressibility. A sequence that is 90% compressible
may contain more information than one that is only 10%
compressible.

OK, what I did was follow up on Steve's example. First
I started with two files that contained 640 repetitions of the
the two sequences, hereinafter referred to as -1- and
-2-,

-1- => repetitions of AAAAAAAAAAA
-2- => repetitions of AAAAATAAAAA

I then began to make larger and larger files. My initial
plan was to double the size each time, but I quickly
exceeded the size that I could copy to the clipboard, thus
the size doubling stops after awhile.

Anyway, the results are in the following table. Compression
was accomplished using gzip with default settings.

======================Table========================
original compressed compressibility increase
file size size(bytes) (%) (bytes)
(bytes) -1- -2- -1- -2-
7360 74 83 98.995 98.872 9
14720 94 102 99.361 99.307 8
29440 136 144 99.538 99.511 8
58880 222 230 99.623 99.609 8
117760 394 401 99.665 99.659 7
176640 565 572 99.680 99.676 7
235520 736 744 99.688 99.684 8
294400 907 914 99.692 99.690 7
353280 1077 1086 99.695 99.693 9
412160 1249 1258 99.697 99.695 9

** -1- => all A's
** -2- => repetitions of AAAAATAAAAA
==============================================

First note that as the file size increases the %
compressibility also increases and approaches
100%, though it will never get to 100% of course.

The important point though is that the information
content is given not by % compressibility but
rather by the size of the compressed file. Even
though the compressibility is approaching 100 %,
the information content continues to increase.
(see second two columns)

It is also interesting to note that the difference
between the two compressed files (last column) remains
virtually constant, varying between 7 and 9 bytes.

While I was at it, I decided to do another little
experiment. I took a single file with 168 repetitions
of AAAAATAAAAA and began to modify it according to
the following rule: I rolled a twenty faced die and
then moved to the right by how many ever spaces was
indicated by the roll. I then changed whatever character
I happened to land on, A to T or T to A. After every
10 changes I saved the file and then looked at the
change in information content versus number of random
changes. Here are the results:

original size = 1932 bytes

# of changes compressed size (bytes)
============ =======================

0 56
10 84
20 113
30 123
40 145
50 156
60 169

This should show clearly enough that random changes
increase the information content.

Brian Harper
Associate Professor
Applied Mechanics
The Ohio State University

"It appears to me that this author is asking
much less than what you are refusing to answer"
-- Galileo (as Simplicio in _The Dialogue_)