RE: ABCDEFGHIJKMNOPQRSTUVWXYZ

Brian D Harper (bharper@postbox.acs.ohio-state.edu)
Wed, 31 Dec 1997 00:07:16 -0500

At 12:09 PM 12/30/97 -0600, Glenn wrote:
>At 12:17 AM 12/30/97 -0500, Brian D Harper wrote:
>>Hmmm... since John has agreed to say no more and Glenn
>>has promised to shut up [and who says miracles can't
>>happen :)] it seems it may be safe for me to speculate
>>wildly with no fear of rebuttal [fat chance].
>
>Unfortunately no miracle has occurred. I think I only made that promise to
>John. :-)
>

Weasel !!!!!! ;-0

BH:
>> It is with
>>this in mind that I mentioned spotting ABCDEF as an
>>orderly pattern, i.e. I was assuming that the alphabet
>>was part of the "hardware" and not part of the algorithm
>>itself. Otherwise, Glenn is correct.
>

Glenn:
>I can agree with this. But I have a question if one uses statistics of the
>alphabet as part of the "hardware" (or compression software) do you really
>get a good measure of the complexity? Afterall, you have moved some of the
>information out of the sequence and into the software.
>

This all depends on the application I think.

Let's consider the typical probability argument: What's
the probability of generating "Me thinks it is like a
weasel" from a random process. As we all know, the
probability is small. Actually, though, we don't even
have to consider a specific phrase here. Since english
is typically 50 % or more compressible we could argue
that the probability is exceedingly small of producing
*any* phrase which even satisfied the statistical
properties of English, let alone being intelligible.

Here we run into what seems almost a universal blindfold
when it comes to probabilities, the assumption that the
various possible outcomes occur with equal probability.
This almost never happens except in games of chance.

If we had, for example, a stochastic process which generated
letters according to the statistical laws of English, then
the typical result would not be "random" in the AIT sense
but 50% compressible. In this case the likelihood of finding
one with "meaning" would improve by many orders of magnitude.

Now consider the infamous typing monkey. Everyone assumes,
of course, that the monkey will strike each key with
equal probability. But this only occurs for thought experiment
monkeys. Real monkeys typing on real typewriters would
not strike each key with equal probability due to spatial
arrangement of the keys, physiological constraints etc.
etc. What this means is that the output from a real typing
monkey will be compressible. Let's suppose its only compressible
by 10% or so. Given this and a sequence that's more than
one or two hundred characters in length I could prove
mathematically that it is virtually impossible for the
monkey to have typed what it just typed. Note that this
has nothing to do with the usual tactic of computing the
probability of that *specific* sequence. It would be virtually
impossible for the monkey to have typed *any* sequence that
happens to be 10% compressible.

How can this be? Well, in my calculations I would conveniently
forget to mention that I assumed the letters occur with
equal probability. If I were to consider the actual prob.
distribution rather than the assumed one I would find the
monkey's masterpiece to be "typical" for that distribution.

Just an illustration of how deceiving probability calculations
can be. If you don't know the probability distribution then
don't bother computing any probabilities.

Typing in the above refreshed my memory a bit. I'm afraid I
botched what I wrote previously. The primary statistical
features of a language (such as frequency at which characters
occur) would not have to "hardwired" in as the compression
algorithm would discover these on its own. One of the first
things it will do is look at the frequency at which the various
characters appear. If they occur at other than equal frequency,
then it can be compressed. Please don't ask me how it works!

>One other thing that was pointed out to me privately. Technically when we
>speak of information in a sequence we are speaking of information density
>not the quantity of information. I will continue in my bad habit of using
>the term 'information' because, unfortunately all the players in this field
>tend to use this terminology.
>

Hmm... I tend to always think in terms of algorithmic information
since I am convinced that this is a really great measure of
complexity. I'm guessing here, but I suspect that information
density is a concept from Shannon info theory, bits/symbol.
Did I guess right?

Brian Harper
Associate Professor
Applied Mechanics
The Ohio State University

"... we have learned from much experience that all
philosophical intuitions about what nature is going
to do fail." -- Richard Feynman