Re: Probability and apologetics

Brian D. Harper (bharper@postbox.acs.ohio-state.edu)
Wed, 6 Sep 1995 13:55:34 -0400

Abstract: The ability to write an abstract shows that most English
(German, French whatever) documents are highly compressible
and thus unlikely to be the result of a monkey typing
randomly.

|(O) \
==> ||
|(0) /

Bill wrote:

first quoting me================
>>Suppose you tossed a fair coin 64 times, for each throw recording
>>a 1 for heads and a 0 for tails. Would either of the following two
>>sequences be more surprising?
>>
>>(A) 1010101010101010101010101010101010101010101010101010101010101010
>>
>>(B) 1110010001010001110000000000011101110101101110100101000111111010
>>

[...]

>Random processes are
>expected to give random results. Very very occasionally they may not,
>but this is so improbable that it can be discounted. This is well
>known and can be proven, but I'm going to skip those details here.
>
>I dare say that very few natural laws would have ever been discovered
>if scientists actually employed the falacy mentioned above in practice.
end of brian:================

now bill:================================================================
The article in New Scientist which I recommended to the group the other day
includes the following:

"Our uneasy attitude toward randomness is probably to do with the human
penchant for spotting patterns. The brain has an architecture ideal for
picking out a person in a crowd, or linking together disparate events --
abilities that have obvious evolutionary advantages[I'm not quoting this
because I agree -- only to establish context - weh]. 'Humans want
order--and they will impose order even when it is not there,' says
Professor Norman Ginsburg, emeritus professor of psychology at McMaster
University, Ontario, who has made a study of how well humans simulate
randomness.
end bill:================================================================

I am very intrigued by this article and plan to go over to the library
and get it as soon as I get the chance. Random means so many different
things to different people that it becomes very difficult to even use
the word without confusion. I generally use the information theoretic
definition for a random sequence [more on this later] in which case
Ginsburg's comment seemed complete nonsense to me at first reading.
In this sense humans cannot impose order when it is not there no matter
how hard they might try. The rest of the quote shows that this is not
what Ginsburg actually meant [what he means is that humans might imagine
that they see order that isn't really there]. This is no doubt true.
For example, Democrats might imagine that there is order in the White
House. Oooh, I really shouldn't have said that ;-).

Perhaps a case in point here is my own sequence (B) above. I don't know
if I actually mentioned that this sequence was generated by tossing a
coin but I'm sure Bill would have concluded that it was based on what
he learned from this article. Note that this sequence contains 11
consecutive 0's at one point. I would *never* have done this had I been
deliberately trying to construct a random sequence!

Here's a humorous anecdote about "Mr Information" himself [Claude Shannon],
that illustrates some of these same points:

Shannon's Outguessing Machine

Not all games of chance are fair, perhaps least of all those which
(who) proclaim fairness the loudest ("I am not a crook"). But some
games make no pretensions of fair play; in fact, <un>fairness is
their very reason for being--such as Claude Shannon's engaging
"outguessing machine" [Sha 53].

Shannon's entrapping contraption initially makes random heads-tails
choices against a human contender. But once the machine has
experienced its first win, it begins to analyze the opponents
"strategy" to a depth of two throws. Does he or she change after
losing a throw? Does the player keep on choosing tails if tails has
brought two previous wins? Does the gambler get chary and head for
heads next? For most people such strategies are mostly subconscious,
but the machine assumes the human to act like a second order Markov
process and uncovers the underlying transition probabilities without
fail. Exploiting these, the machine always wins over the long haul,
except against its creator. Shannon, keeping track of his machine's
inner states, can beat it 6 times out of 10. Of course, anyone could
win 5 out of 10 throws on average by playing random (perhaps by
flipping a true coin). But this is precisely what people, deprived
of proper props, are incapable of doing, as Shannon's machine has
demonstrated again and again by beating a wide variety of human
would-be winners. Specifically, man's mind seems to abhor long
strings of like outcomes--as occur perfectly naturally in truly
random sequences.

Of course, the machine can have bad luck too, especially in its
initial guessing phase. I once wanted to show off the machine's
prowess to a foreign friend (mathematician Fritz Hirzebruch) visiting
Bell Laboratories. As luck would have it, Hirzebruch won 13 times
in a row before his first loss. But thereafter the machine took off
with a vengeance, overtaking the renowned mathematician on throw 31
(i.e., the machine won 16 out of the next 18 throws!) and never fell
behind again--in spite of the fact that Hirzebruch had been told
(in general terms) how the machine worked.
--M. Schroeder, <Fractals, Chaos, Power Laws: Minutes from an Infinite
Paradise>, W. H. Freeman, 1991.

I have been intending for some time to look this article up and see if I
could write a code for the outguessing machine. Sounds like a fun game
for parties. [yes I know, parties for computer geeks and nerds ;-)]

=====:bill again, continuing to quote from New Scientist:================
This love of order can be a severe handicap when you're dealing with random
phenomena. Take the apparently simple task of writing down a sequence of
100 'random numbers'. In research published in the Journal _Perceptual and
Motor Skills_, Ginsburg reported that volunteers had serious problems
making their numbers genuinely random. In particular they tended to avoid
repeating numbers and having sequences like 15, 16, 17. They also disliked
using numbers again until all the others had been 'given a go'.

But true randomness has no memory of what went before, and it is entirely
possible for small samples of random numbers to show fleeting bursts of
apparent order. Where people go wrong, it seems, is in thinking that such
snapshots of randomness have the characteristic lack of order that becomes
apparent only in the very long run... Statistics from Britain's National
Lottery suport this: in those weeks for which numbers drawn contain a
consecutive pair, most had no jackpot winner. By contrast, in one draw
where the numbers were separated from each other by at least three other
numbers, no fewer than 133 people shared the jackpot"
==end New Scientist quote:===========

bill:========
"The point is that no sequence is inherently random or nonrandom.
The sequence Brian included could have come from tosses of a fair
coin."
end:=========

Now I'm really intrigued. Is this your interpretation of what the author
wrote or did he actually say something like this? In any event I must
strongly disagree with the first sentence, practically all sequences are
random. The second sentence I agree with, sequence (A) *could* have
been generated by tossing a fair coin, although this possibility is
extremely unlikely. This is not due to the improbability of obtaining
that *specific* sequence [the creationist's fallacy] but rather due to
the improbability of getting *any* ordered sequence from a random
process [failing to recognize this is what I referred to earlier
as the evolutionists fallacy]

I think the confusion here is due to my using random in two different
ways in the same post. I do this out of habit since this terminology
is common in the literature. Anyway, the random in random sequence means
something different than the random in random process. I think its
clear that this must be the case since it is in fact possible for
a random process to produce a non-random sequence, an apparent
contradiction. Random sequence takes on the definition of random
used in information theory. Whether a sequence is random or not
depends solely on its structure and is completely independent of
the process used to produce the sequence. Thus, one can discuss
the orderliness or randomness of a sequence knowing absolutely
nothing about how it was generated. To avoid this confusion, many
authors have started substituting stochastic process for random
process, reserving the word random for its information theoretic
definition.

I think Murray Gell-Mann explains this word confusion quite
well in the following:

As we have been using the word, applied for instance to a single
string of a thousand bits, random means that the string is
incompressible. In other words, it is so irregular that no way
can be found to express it in shorter form. A second meaning,
however, is that it has been generated by a random process,
that is, by a chance process such as a coin toss, where each head
gives 1 and each tail 0. Now those two meanings are not exactly
the same. A sequence of a thousand coin tosses _could_ produce a
string of a thousand heads, represented as a bit string of a
thousand 1s, which is as far from being a random bit string as it
is possible to get. Of course, a sequence of all heads is not at
all probable. In fact, its chance of turning up is only one in a
very large number of about three hundred digits. Since most long
strings of bits are incompressible (random) or nearly so, many sets
of a thousand tosses will lead to random bit strings, but not all.
One way to avoid confusion would be to refer to chance processes as
"stochastic" rather than random, reserving the latter term mainly
for incompressible strings.
-- Gell-Mann, M. (1994). <The Quark and the Jaguar>. New York:
W. H. Freeman and Company.

bill:==================================================================
The article points out that these apparent patterns are typically "fleeting"
and that the lack of order typical of randomness typically requires long
strings of numbers. But how long is long? 100? 1000? 10^19?. Depending
on what tests you use and how persnickety you are, your mileage will vary.
==========================================================================

Does the author elaborate on this point? How long is long is the key both
here and in my discussion with Glenn. The longer the sequence the stronger
my argument becomes. This point is more critical with a binary string
than it is with a protein since there are only two possibilities at
each location of a binary string, i.e. it is much more probable to
get 10 consecutive heads than it is to get 10 consecutive alanines.

Gell-Mann uses a length of 1000 in the example above, does it really
need to be this long? I don't think so. My string of length 64 has
exactly the features we have been discussing. There are substrings of
11 consecutive 0's and 6 consecutive 1's yet there is no obvious
pattern to be found. Also, the total number of 1's is nearly equal to
the total number of 0's.

I have more to say on "how long is long", first I want to discuss an
interesting side-light that has some importance to the discussion.
It is well known [look in practically any book on information theory]
that practically every sequence is random (incompressible) or nearly
so. Even so, one cannot actually prove that any *specific* sequence is
random. An interesting paradox, almost all sequences are random yet
no specific sequence can be proven to be random. Gregory Chaitin
showed that this little oddity is closely related to Godels theorem and
Turings halting problem. This seems to throw a real monkey wrench into
the works. Given this, how is it possible to say *anything* about a
given sequence. Well, its not so bad as it seems. It is impossible to
prove any given sequence is incompressible, however, it *is* possible
to prove a sequence is compressible merely by compressing it. For example,
the nose-picking phrases I gave in my last post are compressible to
"if pick nose get warts". Since most sequences are incompressible,
the abilty to compress these phrases is itself sufficient reason to
say that it is very unlikely that these phrases were generated by a
stochastic process.

This brings us back to "how long is long". We know that practically
all sequences are incompressible or nearly so. This doesn't really
get at your question directly though. Given our distinction between
random sequence and random process, hereafter stochastic process for
sake of clarity, I think the question of interest is "how long of a
sequence do you need to determine (with reasonable certainty) whether a
sequence was generated by a stochastic process?"

Some time ago someone on bionet.info-theory suggested a way of looking
at this that I have found very useful. Returning to my two sequences
above imagine that you are receiving each sequence from some unknown
source, one element at a time going from left to right. At what point
can one use the information at hand to predict the rest of the sequence.
This is an indication of the compressibility of the sequence. For case
(A), the entire sequence can be predicted from the first two elements
01 indicating that this sequence is highly compressible. For case (B)
there is no point at which one could predict the rest of the sequence.
But remember that compressibility itself does not determine whether
the sequence is produced from a stochastic process. I have to see much
more than just 01 before I can be reasonably certain that the process
generating sequence (A) is not stochastic. After 010101 I might be vaguely
suspicious and my confidence would mount with each succesive pair of
01's. So, we arrive back at "how long is long". This question can be
made more precise at this point by looking at degree of compressibility.
For example, consider the following strings:

01010101
01010101010101
010101010101010101010101
0101010101010101010101010101010101010101

Each of these sequences is predictable from the first two elements,
however I would be much more confident in saying that the last one
is not produced from a stochastic process than the first since it
is much longer. But I can say the same thing more precisely by noting
that the last sequence is more compressible than the first. Roughly
speaking, what I mean by this is that the ratio of the length of
the repeating element to the total length is much smaller for the
last case.

We can now make this idea of degree of compressibility more precise
by introducing the notion of c-incompressibility. To avoid confusion
with this word note that it's c-*in*compressibility.

A sequence is said to be c-incompressible if it's compressed
length is greater than or equal to n-c where n is the length
of the uncompressed string. In other words, if the length
of the string is 64 (as in the examples above) then all strings that
can be compressed to a length of 60 or larger would be
4-incompressible. Note this is *all* strings so that, for example,
a 5-incompressible string would also be 4-incompressible etc.
Here's the interesting result, at least half of all the strings
are 1-incompressible, at least 3/4's are 2-incompressible,
at least 7/8's are 3-incompressible etc. Generalizing this,
we can say that at least [1 - (1/2)^c] of all the 2^n possible
sequences of length n are c-incompressible. Just the first
result of this is significant in our present discussion, i.e.
for n = 100 we have a fifty-fifty chance of getting a string
that can be compressed *at most* to a length of 99.

I think it's obvious that the "amount of compressibility", as
we usually think of it, depends on both c and the length of
the string n. So, I am going to define the % compressibility
as follows: %C = [c/n]*100. To illustrate, for n = 100,
a 10-incompressible string can be compressed by no more than 10% of
its original length. Now, using the ideas in the previous
paragraph we see that about 99.9% of all possible strings cannot
be compressed by more than 10%. Now consider the case with n = 64.
In this case about 99.9% of all strings cannot be compressed by more
than 16%. Now, ordered generally means compressible and we can
speak of a highly ordered sequence as being highly compressible,
i.e. as being c-incompressible with c close to n. We could also
talk about this in terms of %C as defined above, i.e. highly
ordered might mean something like 95% compressible while ordered
might be something in the range between say 75% and 95% compressible.

The following table gives the probability of obtaining *any*
string that is either 50% 75% or 95% compressible as a function
of the length of the string.

length n 50% 75% 95%
-------- -------- --------- ---------

20 1(10)^-3 3(10)^-5 2(10)^-6
40 9(10)^-7 9(10)^-10 4(10)^-12
60 9(10)^-10 3(10)^-14 7(10)^-18
100 9(10)^-16 3(10)^-23 3(10)^-29
200 8(10)^-31 7(10)^-46 6(10)^-58
500 6(10)^-76 1(10)^-113 1(10)^-143
1000 3(10)^-151 2(10)^-226 1(10)^-286