Re: pure chance

billgr@cco.caltech.edu
Fri, 17 Jan 1997 16:53:14 -0800 (PST)

Brian Harper:

[straight to the meat... :-)]

> This takes us back to a question [i.e. "this is nice, but what good
> is it?"] that I have a hard time addressing since I know so little
> about molecular biology. I'll try anyway, of course ;-). Actually,
> I think this would be a great question to ask on bionet.info-theory,
> tactfully of course. Something like listing the top five contributions
> of information theory to biology.
>
> Here's a few stabs of mine, not necessarily five and not necessarily
> profoundly significant.
>
> 1). The first thing that comes to mind is the Shannon-Macmillan
> theorem that I mentioned recently in another post. Concluding
> a section on this theorem in his book, Yockey writes:
>
> ===========
> Probability theory contains many theorems that are
> contrary to most people's intuition. Author's in
> molecular biology, almost without exception, are
> unaware of the Shannon-McMillin theorem, and have
> been led to false conclusions. They do not realize
> that, by the same token as the sequences in the
> dice throws, all DNA, mRNA and protein sequences
> are in the high probability group and are a very
> tiny fraction of the total possible number of such
> sequences.
> -- Yockey, <Information Theory and Molecular Biology>,
> Cambridge University Press, p. 76.
> =============
>
> What a teaser, I wish he would elaborate more on the false conclusions
> that were reached.

Ah, ha!!!!! Got you now... :-) As I have been thinking about it,
this is an excellent example of what we were talking about in varying
the bases. (Of course, varying just a couple wouldn't qualify.)
The problem is that if you vary a few bases, you might get yourself
over the edge of the 'usual group' (what info theory calls the
'typical sequences') into no-man's-land, which is a lot bigger than
the small number of 'typical sequences.' It is exactly this fact
that makes error-free transmission possible in the first place
(Shannon's second theorem, I think is the number.)

Of course, I'm not sure that anyone knows what the typical DNA
sequence looks like, and it probably varies with time (what would
you call that, anyway? :-) :-) :-)). So, as always, you have to
delimit the ensemble, and if you delimit the ensemble as being
the typical set, which is safe, then if you deviate from the
typical set, then you have NO IDEA where you are, and you need
basically the entire DNA string to let you know. If you are
inside the typical set, on the other hand, you are dealing with
such an infinitesimal set of the possible sequences that you
can declare your ensemble a lot smaller, and so have less information
in the genome.


> 2). Another item that keeps coming up is the Central Dogma.
> Let's see if we can take a trip back in time to the days when
> the structure of DNA was being unraveled and the genetic code
> was being deciphered. Could information and
> coding theory have played a roll at this point? We're just imagining,
> so never mind that this was also roughly the time period when
> Shannon was making his monumental contributions to the
> subject. Well, suspecting that information was being transferred
> from DNA to protein and knowing there are 61 codons and 20
> amino acids one could have predicted that there should be a
> central dogma. I think a prediction like this, before the fact,
> would have been monumental. Yockey mentions that Crick did
> predict from the 61-20 that the code was redundant before this
> was actually determine experimentally.

Yep. A horrible failed opportunity! :-) (Just like Einstein and
the Big Bang...) It is so obvious in retrospect.

> 3). Your mention of mutual information reminded me of something
> Yockey harped on a lot in his book, namely he wants to get
> rid of the vague and meaningless measure "similarity" [as in chimps
> are 99.6% similar to humans] and replace it with mutual entropy.
> Here's are a quote from his book to illustrate:
>
> =======================
> A distinguished group of molecular biologists (Jukes & Bhushan,
> 1986; Reeck _et al_., 1987; Lewin, 1987) has called attention to
> sloppy terminology in the misuse of 'homology' and 'similarity'.
> Nevertheless, editors still permit authors to qualify 'homology'
> and to confuse that word with similarity. _Mutual entropy_ is the
> correct and robust concept and measure of similarity so that the
> sooner _per cent identity_ disappears from usage the better. Mutual
> entropy is a mathematical idea that reflects the intuitive feeling
> that there is a quantity which we may call information content
> in homologous protein sequences. Clearly, the shortest message
> which describes at least one member of the family of sequences is
> what one would properly call the information content.
> -- Hubert Yockey, _Information Theory and Molecular Biology_,
> Cambridge University Press, 1992, p. 337.
> =========================

Well, I'll just have to get the book, as you keep hintin. :-)
It is checked out right now, so it might be a while.

> 4) In his book Yockey uses coding theory to show how the genetic
> code could have evolved by a stochastic process (random walk).
> Part of this was the transition from a doublet to a triplet code.
> I don't want to go into many details (since I don't understand
> them :). One of his concluding remarks about this in the
> _Epilogue_ is, I think, important wrt the origins debate, so
> I'll quote that:
>
> ==============
> "...The argument led to support of the endosymbiotic theory
> without being so contrived. The theory shows that the
> number of triplet codes that merit consideration is limited
> to the number of codes in the last two steps in the evolution
> of the doublet codes. The modern triplet genetic codes emerged
> from the bottleneck between the first extension doublet codes
> and the second extension triplet codes. It was by this means
> that the several modern triplet codes evolved without the
> necessity of trying the vast number of possible triplet codes."
> Yockey, ibid. p. 338.
> ===============

Interesting. Can some of the molbio people enlighten us as to
the doublet/triplet code ideas? I am reading the textbook, and
it is really interesting. I think I can hold my own with 5' and
3' ends and tRNA, mRNA, rRNA, replication forks, introns, and
maybe a few more of the basics. :-)

> This reminded me of a basic tenet in many creationist probability
> arguments. That random searches just aren't effective (in finite
> time) because there are just too many possibilties to search.
> Yockey concluded that one can make the transition from doublet
> codes to the modern triplet code without searching all the
> possibilities.

Well, I'll just have to get the book. That's all there is to it.

-Greg