Re: Sequence redundancy in pr...

Brian D. Harper (bharper@postbox.acs.ohio-state.edu)
Tue, 5 Sep 1995 13:43:00 -0400

Abstract: In two previous posts I have dealt with (1) some theological
considerations involved with the probability argument and
(2) what I considered to be the fallacy of the argument from
improbability as well as a fallacy often committed by
evolutionists in answering this argument. In this post I
want to try to outline what I consider to be a real problem
with Glenn's recent posts on this subject.

First let me say that Glenn's analogy relating functional proteins
to meaningful phrases in english is useful in many ways. If stretched
too far, however, it ceases to be very realistic. One problem I
see is that english phrases may be too flexible to changes, misspellings
etc. For example, Glenn wrote:

Now, consider that misspellings do not always ruin the meaning.
I could say "If you pick you're nose..." "If you pick your
nostral..." the meaning is still conveyed albeit less efficiently.
Look at the posts on the reflector. There are lots of misspelled
words. I misspell "Occassionally" all the time. Tonight I saw Brian
Harper consistently misspell "falacy" (sorry Brian). It should be
"fallacy." But I fully understood what Brian was saying so the
misspelling didn't hurt communication. You understand me when I
misspell "occassionally" Thus the number of USEFUL strings are
many times more than the number of CORRECT strings. If you say that
a phrase can have up to 5 misspellings before you can not understand
it any longer than there are 1,650,000 sequences which convey the
message about nose picking.

Not only could one read falacy and understand that fallacy was intended,
one could do the same for all of the following:

falllacy, faalacy, falicy, faallaaccy, falllaci etc. etc.

The problem then is the combination of English phrases with an intelligent
interpretor is probably much more flexible in retaining function (meaning)
than in a protein sequence. While it is true that there is a lot of
flexibility at some sites in a protein, some sites are also invariant,
i.e. one has to get one specific amino acid at some sites in order to
retain function. There are probably not too many english phrases that
have the property that a single mistake completely destroys the meaning.
It is also my understanding that a large proportion of the information
content of a protein comes from the invariant sites.

Secondly, I think the numbers were working somewhat in your favor with
the example "ieateggs" simply because the phrase is so short, i.e. it
has such a small information content.

Let's go back to picking our noses: :<~)

if you pick your nose; you get warts

The probability of generating this exact sentence is about 10^-50.
Here I have used 27 possible characters (26 letters + space) but
have ommitted punctuation marks. Of course, 33,000 ways of conveying
this message pale in comparison to the number 10^50. Even if you
could consider all of these possibilities you only increase the
probability from 10^-50 to about 10^-46, it doesn't help much.

Another point though is that when you start to consider other
sequences that convey the same meaning, these sequences are for
the most part all of different length, changing the associated
probabilities. For example, here are several different ways of
saying the same thing. The total number of sequences of the same
length is given in [brackets] following each phrase.

a) if pick nose get warts [10^31]

b) To produce warts, pick your nose [10^44]

c) if you pick your nose you contract warts [10^57]

d) pulling snot out of your nose makes warts [10^58]

e) if you pick your nasal passages you cause warts [10^67]

f) if you pick your nostrils you get hypertrophy of the corium [10^84]

g) When a finger is put in a nostril the finger produces calloused
bumps [10^99]

h) When a digit is placed in a nostril the finger produces calloused
bumps [10^101]

i) When a digit is inserted into a nostril the finger produces calloused
bumps [10^107]

My point here is that your adding up all the different ways of producing
a particular meaning is not really fair. Suppose you start with phrase
(a). You now have 1 out of about 10^31. At this point you can't go to
(b) and claim you have 2 out of 10^31. You also have to consider the
other roughly 10^44 sequences of that particular length. So, finding
different ways of saying the same thing only helps if they are of
equal or shorter length than the original.

Now let's go back to proteins. First a brief comment about the
following statement:

Glenn:============================================================
The problem, mathematically as I see it, with all the probability
calculations is that we have not done a sufficient job of sampling
the functional space. We have assumed that 1 sequence = 1 function.
This may not be true. It may be that trillions of sequences = 1
function.
end:==============================================================

This is only a problem with some calculations, for example in Terry's
example there were about 10^55 functional sequences and in Yockey's
there were about 10^93 functional sequences. Many many more than the
trillion you mention above, but it doesn't help much at least not
for the case of a random search since these numbers are infinitessimal
fractions of the total number possible.

In the same post you also wrote:

Glenn:============================================================
Earlier tonight I noted that cytochrome C comes in sequence lengths
from 103 to 112. This is somewhat similar to the sentences above.
They are coming in different lengths. So in order to calculate the
volume of a particular functionality, in sequence space, one needs
to know how many families of seqeunces can produce a given function
and the distribution of sequence lengths of those families.
end:===============================================================

[for sake of comparison with the above, Yockey's calculation was based
on a cytochrome c with sequence length of 110]

As noted above, Yockey found that there were 10^93 functional sequences
of length 110. According to what you wrote above there are 10 known
sequence lengths for cytochrome c. Let's suppose there are also 10^93
functional sequences for each of these lengths. This seems pretty
generous to me in view of your suggestion of trillions above, 10^93
is literally trillions of trillions ;-). But we note that we have only
increased the number of functional sequences by a factor of 10, now
we have about 10^94. This is an unthinkably large number. I haven't
actually done the calculation :), but I doubt one could actually have
such a huge number of cytochrome c molecules even if all the matter
in the universe were converted into cytochrome c. So, we do have
a tremendous amount of flexibilty here, but it is not enough to
entertain a random search mechanism since there are 10^137 possible
sequences of length 110. Now, as I argued above, it's not really
fair to compare 10^94 with 10^137, you also should add in the number
of possible sequences of length 103, 104, 105 .... But this addition
doesn't really change the conclusion. Without it, the probability of
finding any one of the 10^94 functional sequences is still only
10^-43.

Glenn:===========================================================
To see this, assume for a moment that the sequence Leu-Ser is
capable of performing cytochrome's function and it is the shortest
sequence which can do that. In this case, if you randomly make
70-residual-long sequences, you would be amazed to find that one in
400 sequences have that combination. (If Gordie Simons thinks that
the duplicate occurrences of leu-ser in a 70-residual seqeunce lowers
this probability significantly, maybe he should comment.) Thus it
is not unlikely to find Leu-Ser contained in a 70 sequence. Cutting
the 70 unit sequence with two cuts at randomly chosen positions, gives
you a 1/5000 chance of cutting the leu-ser sequence. With the number
of proteins in a lab vat, this is a high probabiliy event. As
long as the probability for finding a given functionality is in the
range of 10^-14 to 10^-18, it is likely that given the volumes of
proteins which could be made, and the time frame we have, that a
given function could be found randomly. It might not find cytochrome C
but it might find an equivalent functional molecule.
end:=================================================================

This seems to me to be just wild speculation. Perhaps a biochemist or
mol biologist could comment: what is the shortest possible length of
a protein that could perform some nontrivial function? Is it even
remotely conceivable that something as simple as leu-ser could perform
the function of cytochrome c?

But let's suppose this works. What happens next? This is a real problem
I think with the protein first scenario. What happens next? One has to
go from this very simple beginning to highly complex cells without
the aid of selection. This is tough going to say the least.

But I think you are on the right track. The initial stage of any origin
of life scenario will likely be a random process. Based on what we have
said in this thread, the key to overcoming the improbability problem
is to start with short sequences. To greatly strengthen your argument you
should consider switching from proteins to short self-replicating
RNA sequences. This is the modern view.

One advantage of this is seen immediately. The number of basic building
blocks for RNA is considerably less than that for a protein. For this
reason, the total number of possible RNA sequences of some given length
will be significantly less (many orders less) than the number of possible
proteins of the same length. This will allow you to propose a much longer
initial self-replicator to arise purely by chance and still stay within
the roughly 10^-20 probability barrier that you mention above.

The other advantage to this view is that it is now easier to answer the
question "what happens next?" to start building complexity. With a simple
self-replicator as the initial step one can start talking right away
about selection pressures etc. so that the rest of the road is no longer
purely or even predominantly chance.

Well, I don't want to be too optimistic here :). The two punch argument
simple self-replicator + selection is considered by many sufficient
to end all debate. In fact it is not, there is a severe problem
encountered in the self-organizational stage following this initial
step. I'm not going to try to get into that here, as this post is too
long already. Perhaps I can later if people are interested. It seems to
me pretty clear that the resolution of this problem, if it is resolved,
will require fine tuning of several parameters as I have mentioned
elsewhere.

==

Brian Harper:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=
"I believe there are 15,747,724,136,275,002,577,605,653,961,181,555,468,
044,717,914,527,116,709,366,231,425,076,185,631,031,296 protons in the
Universe and the same number of electrons." Arthur Stanley Eddington
:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=:=