ASA - January 2005: RE: Spellbound? (was Re: Cobb County)

From: Hon Wai Lai <honwai@bumble.u-net.com>
Date: Sat Jan 29 2005 - 18:24:09 EST

Claims of hidden code in the Bible are classical examples of data mining
and blatant abuse of statistical techniques. There is a well written
article on data mining by Ronald Kahn:

http://www.barra.com/newsletter/nl165/biblenl165.aspx

The Bible Code

by Michael Drosnin (New York: Simon & Schuster, 1997)

Reviewed by Ronald N.
<http://www.barra.com/newsletter/NL160/NLbios.asp#Kahn> Kahn

"For three thousand years a code in the Bible has remained hidden. Now
it has been unlocked by computer-and it may reveal our future."

So begins the jacket copy for The Bible Code by Michael Drosnin, a book
receiving considerable public attention, though not much in the
financial press. And yet we couldn't resist reviewing it because its
methodologies are surprisingly similar to the worst data mining excesses
of investment research. This issue's lead article on data mining
discusses Norman Bloom, arguably the world's greatest data miner. He
tried to prove the existence of God through baseball statistics and the
Dow Jones average. Now Mr. Drosnin, armed with the Bible and a computer,
has taken up the cause.

The idea that the Bible contains encoded information has been around for
quite some time. But in 1994, in Statistical Science, three
statisticians reported their analysis of equidistant letter sequences
(ELS) in the book of Genesis. An ELS is a fairly simple type of code.
For example, a particular ten-letter word may begin with the 3,057th
letter and continue with the 3,067th letter, 3,077th letter,..., and the
3,147th letter.

Of course, words will appear encoded in the Bible just by random chance.
So Doron Witztum, Eliyahu Rips, and Yoav Rosenberg devised a statistical
test of whether Genesis contains any meaningful information. They
assumed that if meaningfully related words appeared encoded "near" each
other, that would imply meaningfully encoded information. So while the
word "hammer" might appear at random and the word "anvil" might appear
at random, these connected words wouldn't appear near each other unless
the text contained meaningful encoded information. With that assertion,
they developed a highly convoluted measure of the "closeness" of the
encoded appearance of any two given words, chose a list of (according to
them) meaningfully related word pairs (names and dates for a list of
famous rabbis), and finally, analyzed whether those word pairs appeared
closer than expected by random chance in Genesis. According to this
test, the probability that random data would generate encoded word pairs
as close as they observed was only 16 out of 1 million.

Starting from this academic paper, author Michael Drosnin applied his
computer to the entire Bible, without regard to any statistical
principles. Searching now for individual words of interest, he then
looked for other suggestive words nearby, backwards or forwards, after
applying liberal interpretive skills. The result in his case is a book
full of remarkable coincidences, completely lacking any statistical
analysis of significance.

For this review, let's consider the original statistical analysis and
the popular book separately. The popular book is simply a fantastic
example of data mining run amok. If Drosnin didn't find this
coincidence, he would find another. If one interpretation of the word
didn't fit, he used another. His quest would have been equally
successful applied to War and Peace, Men are from Mars, Women are from
Venus, or even an old Sears catalog. Proust's insight (see quotation in
"Data Mining" article below) clearly applies here.

The original statistical paper does include an analysis of significance.
So my criticism here is more technical. The author's definition of
closeness is so contorted as to defy much intuition, but it may be very
sensitive to just a small number of very close observations. Another
similar analysis, by Dror Bar-Natan, Alec Gindis, Aryeh Levitan, and
Brendan McKay of the Australian National University, found no unusual
closeness for the same famous rabbis and their most famous books. And
finally, the occasional appearance of encoded word pairs near each other
is simply a far cry from finding or proving (let alone decoding) any
meaningful information encoded in the Bible.

For investment researchers, The Bible Code is just a wonderful example
of the seductive appeal of random patterns found in large data sets. The
book, if not also the paper, ignores all four guidelines discussed on
page 29 of this issue: intuition, restraint, sensibility, and
out-of-sample testing. Researchers-investment and biblical-ignore these
at their own peril.

Data Mining is Easy

Seven Quantitative Insights into Active Management-Part 5

by Ronald N. <http://www.barra.com/newsletter/NL160/NLbios.asp#Kahn>
Kahn

Why is it that so many strategies look great in backtests and disappoint
upon implementation? Backtesters always have 95% confidence in their
results, so why are investors disappointed far more than 5% of the time?
It turns out to be surprisingly easy to search through historical data
and find patterns that don't really exist.

To understand why data mining is easy, we must first understand the
statistics of coincidence. Let's begin with some non-investment
examples. Then we will move on to investment research.

The statistics of coincidence

Several years ago Evelyn Adams won the New Jersey state lottery twice in
four months. Newspapers put the odds of that happening at 17 trillion to
1, an incredibly improbable event. A few months later, two Harvard
statisticians, Percy Diaconis and Frederick Mosteller, showed that a
double win in the lottery is not a particularly improbable event. They
estimated the odds at 30 to 1. What explains the enormous discrepancy in
these two probabilities?

It turns out that the odds of Evelyn Adams winning the lottery twice are
in fact 17 trillion to 1. But that result is presumably of interest only
to her immediate family. The odds of someone, somewhere, winning two
lotteries-given the millions of people entering lotteries every day-are
only 30 to 1. If it wasn't Evelyn Adams, it could have been someone
else.

Coincidences appear improbable only when viewed from a narrow
perspective. When viewed from the correct (broad) perspective,
coincidences are no longer so improbable. Let's consider another
non-investment example: Norman Bloom, arguably the world's greatest data
miner.

Norman died a few years ago in the midst of his quest to prove the
existence of God through baseball statistics and the Dow Jones average.
He argued that "BOTH INSTRUMENTS are in effect GREAT LABORATORY
EXPERIMENTS wherein GREAT AMOUNTS OF RECORDED DATA ARE COLLECTED, AND
PUBLISHED" (capitalization Bloom's). As but one example of thousands of
his analyzes of baseball, he argued that the fact that George Brett, the
Kansas City third baseman, hit his third home run in the third game of
the playoffs, to tie the score 3-3, could not be a coincidence-it must
prove the existence of God. In the investment arena, he argued that the
Dow's 13 crossings of the 1,000 line in 1976 mirrored the 13 colonies
which united in 1776-which also could not be a coincidence. (He pointed
out, too, that the 12th crossing occurred on his birthday, deftly
combining message and messenger.) He never took into account the
enormous volume of data-in fact, an entire New York Public Library's
worth-he searched through to find these coincidences. His focus was
narrow, not broad.

With Norman's passing, the title of world's greatest living data miner
has been left open. Recently, however, Michael Drosnin, author of The
Bible Code, seems to have filled it. (For details, see the book review
<http://www.barra.com/newsletter/nl165/BiBleNl165.asp> .)

The importance of perspective to understanding the statistics of
coincidence was perhaps best summarized by, of all people, Marcel
Proust-who often showed keen mathematical intuition:

The number of pawns on the human chessboard being less than the
number of combinations that they are capable of forming, in a theater
from which all the people we know and might have expected to find are
absent, there turns up one whom we never imagined that we should see
again and who appears so opportunely that the coincidence seems to us
providential, although, no doubt, some other coincidence would have
occurred in its stead had we not been in that place but in some other,
where other desires would have been born and another old acquaintance
forthcoming to help us satisfy them. (The Guermantes Way, Cities of the
Plain, Volume 2 of translation of Marcel Proust's Remembrance of Things
Past [New York: Vintage Books, 1982], p. 178.)

Investment research

Investment research involves exactly the same statistics and the same
issues of perspective. The typical investment data mining example
involves t-statistics gathered from backtesting strategies. The narrow
perspective says: "After 19 false starts, this 20th investment strategy
finally works. It has a t-statistic of 2."

But the broad perspective on this situation is quite different. In fact,
given 20 informationless strategies, the probability of finding at least
one with a t-statistic of 2 is 64%. The narrow perspective substantially
inflates our confidence in the results. When viewed from the proper
perspective, confidence in the results lowers accordingly.

Four guidelines for backtesting integrity

Given that data mining is easy, how can we safeguard against it? Here
are four guidelines for data mining integrity:

* Intuition
* Restraint
* Sensibility
* Out-of-sample testing

The intuition guideline demands that researchers investigate only those
strategies with some ex ante expectation of success. Investment research
should never involve free-ranging searches for patterns without regard
for intuition.

The restraint guideline attempts to minimize the number of strategies
investigated-i.e., to keep the broad and narrow focus similar. In the
best case, researchers decide ex ante exactly which strategies and
variants they will investigate, run their tests, and look at the
answers. They do not go back and continually refine their
investigations.

The sensibility guideline deletes results that seem improbably
successful. Observed t-statistics that are too large may signal database
errors or an improper methodology rather than a new strategy.

The fourth guideline, out-of-sample testing, is the statistician's
answer to the curse of data mining. Coincidences observed over one data
set are quite unlikely to reoccur in another independent data set.

Conclusions

Many backtesting results are not foolproof demonstrations of strategy
value but merely coincidence. Four backtesting guidelines can help avoid
data mining.

-----Original Message-----
From: asa-owner@lists.calvin.edu [mailto:asa-owner@lists.calvin.edu] On
Behalf Of Vernon Jenkins
Sent: 29 January 2005 21:39
To: bivalve; ASA
Subject: Re: Spellbound? (was Re: Cobb County)

David,

You said (28 Jan), " ...Genesis 1...makes it clear that methodological
naturalism is highly appropriate for addressing pentultimate or
antepentultimate origins...". And as Christopher wrote earlier (23 Jan),
" The simple reason why the _supernatural_ must never be allowed 'a foot
in the door' is that it cannot be tested, you cannot get a handle on it,
and it is just another God-of-the-gaps arguments."

My case is simply this: the numero-geometrical features of the Bible's
opening Hebrew words provide the _test-case_ that Christopher dismisses;
for unless a reasonable _naturalistic_ explanation can be found for
these phenomena, the principle of MN is permanently undermined. In other
words, the findings of science - particularly as they offer a direct
challenge to scriptural revelation - may well be invalid because of
failures to take account of the possibility of supernatural
interference.

By the way, I can't agree with Don who wrote (28 Jan), " We have no
proof - and are unlikely ever to have proof - that these complexities
actually required special divine input. (What form would such proof
take, anyway?)." Whilst the overall probability of the many rare and
unique features that we find crammed into these 7 opening words may be
difficult to quantify, the impression created is that this must be
_vanishingly small_. [And, let us observe that such situations have not
deterred investigators from making far-reaching claims in other fields!]

It is worth spending a little time contemplating this wonder, for it
comprises a multi-stage development. When the words were first recorded,
they possessed a relevant literal meaning - nothing more. Some centuries
later - following the adoption by the Jews of a system of alphabetic
numeration - the letters and words acquired, in addition, the status of
numbers - and it is at this point that most of the geometrical and other
features of interest _became available_ for inspection (but spotted by
none, apparently!). The discovery of the universal constant 'e' by
Euler, the adoption of the Metric System (both in the 18th century) and
the creation of a standard for cut paper sizes in the 1960s,
consolidated the Genesis 1:1 edifice as it is now known..

I have written of this as a 'standing miracle' - something that will
forever remind those of us who seek and value truth - of our Creator's
Being and Sovereignty, and of His Grace in providing those who love Him
with such firm assurances in these testing days.

Vernon
www.otherbiblecode.com

----- Original Message -----
From: "bivalve" < <mailto:bivalve@mail.davidson.alumlink.com>
bivalve@mail.davidson.alumlink.com>
To: "ASA" < <mailto:asa@calvin.edu> asa@calvin.edu>
Sent: Friday, January 28, 2005 10:26 PM
Subject: Re: Spellbound? (was Re: Cobb County)

>> (3) You have completey ignored the second matter I raised, viz the >
lessons that might be learned from the widespread negative reaction to >
news of the numero-geometrical features of Genesis 1:1. Wouldn't you >
agree that these phenomena strongly challenge the view that >
_methodological naturalism_ is the only valid basis for the proper >
investigation of ultimate origins? It would be good and proper if you >
were to consider joining me in disabusing others of this significant >
error.
>
> I don't think the numero-geometrical features address whether
methodological naturalism is relevant to ultimate origins.
>
> The text of Genesis 1:1 makes it clear that methodological naturalism
will be unable to address ultimate origins. However, Genesis 1 also
makes it clear that methodological naturalism is highly appropriate for
addressing pentultimate or antepentultimate origins:
>
> Everything is created by an orderly God.
> There aren't any rouge powers, quarreling deities, or other factors
that might disrupt the regular workings of creation, unlike pagan views.
> We were made to rule over creation. In order to do so well, as good
stewards, we need to be able to determine how creation works.
>
> Thus, there are good reasons to expect a study of the ordinary
workings of the universe to be very productive and informative.
>
> Dr. David Campbell
> Old Seashells
> University of Alabama
> Biodiversity & Systematics
> Dept. Biological Sciences
> Box 870345
> Tuscaloosa, AL 35487-0345 USA
> <mailto:bivalve@mail.davidson.alumlink.com>
bivalve@mail.davidson.alumlink.com
>
> That is Uncle Joe, taken in the masonic regalia of a Grand Exalted
Periwinkle of the Mystic Order of Whelks-P.G. Wodehouse, Romance at
Droitgate Spa
>
>
Received on Sat Jan 29 18:25:32 2005

This archive was generated by hypermail 2.1.8 : Sat Jan 29 2005 - 18:25:35 EST