ASA - January 2005: Re: Spellbound? (was Re: Cobb County)

From: Vernon Jenkins <vernon.jenkins@virgin.net>
Date: Sun Jan 30 2005 - 17:46:35 EST

MessageHon Wai Lai,

My work has nothing to do with ELS, or with Michael Drosnin. Why not examine the evidence for yourself?

Vernon
www.otherbiblecode.com

  ----- Original Message -----
  From: Hon Wai Lai
  To: 'ASA'
  Sent: Saturday, January 29, 2005 11:24 PM
  Subject: RE: Spellbound? (was Re: Cobb County)

Claims of hidden code in the Bible are classical examples of data mining and blatant abuse of statistical techniques. There is a well written article on data mining by Ronald Kahn:

http://www.barra.com/newsletter/nl165/biblenl165.aspx

  The Bible Code
  by Michael Drosnin (New York: Simon & Schuster, 1997)

Reviewed by Ronald N. Kahn

"For three thousand years a code in the Bible has remained hidden. Now it has been unlocked by computer-and it may reveal our future."

So begins the jacket copy for The Bible Code by Michael Drosnin, a book receiving considerable public attention, though not much in the financial press. And yet we couldn't resist reviewing it because its methodologies are surprisingly similar to the worst data mining excesses of investment research. This issue's lead article on data mining discusses Norman Bloom, arguably the world's greatest data miner. He tried to prove the existence of God through baseball statistics and the Dow Jones average. Now Mr. Drosnin, armed with the Bible and a computer, has taken up the cause.

The idea that the Bible contains encoded information has been around for quite some time. But in 1994, in Statistical Science, three statisticians reported their analysis of equidistant letter sequences (ELS) in the book of Genesis. An ELS is a fairly simple type of code. For example, a particular ten-letter word may begin with the 3,057th letter and continue with the 3,067th letter, 3,077th letter,..., and the 3,147th letter.

Of course, words will appear encoded in the Bible just by random chance. So Doron Witztum, Eliyahu Rips, and Yoav Rosenberg devised a statistical test of whether Genesis contains any meaningful information. They assumed that if meaningfully related words appeared encoded "near" each other, that would imply meaningfully encoded information. So while the word "hammer" might appear at random and the word "anvil" might appear at random, these connected words wouldn't appear near each other unless the text contained meaningful encoded information. With that assertion, they developed a highly convoluted measure of the "closeness" of the encoded appearance of any two given words, chose a list of (according to them) meaningfully related word pairs (names and dates for a list of famous rabbis), and finally, analyzed whether those word pairs appeared closer than expected by random chance in Genesis. According to this test, the probability that random data would generate encoded word pairs as close as they observed was only 16 out of 1 million.

Starting from this academic paper, author Michael Drosnin applied his computer to the entire Bible, without regard to any statistical principles. Searching now for individual words of interest, he then looked for other suggestive words nearby, backwards or forwards, after applying liberal interpretive skills. The result in his case is a book full of remarkable coincidences, completely lacking any statistical analysis of significance.

For this review, let's consider the original statistical analysis and the popular book separately. The popular book is simply a fantastic example of data mining run amok. If Drosnin didn't find this coincidence, he would find another. If one interpretation of the word didn't fit, he used another. His quest would have been equally successful applied to War and Peace, Men are from Mars, Women are from Venus, or even an old Sears catalog. Proust's insight (see quotation in "Data Mining" article below) clearly applies here.

The original statistical paper does include an analysis of significance. So my criticism here is more technical. The author's definition of closeness is so contorted as to defy much intuition, but it may be very sensitive to just a small number of very close observations. Another similar analysis, by Dror Bar-Natan, Alec Gindis, Aryeh Levitan, and Brendan McKay of the Australian National University, found no unusual closeness for the same famous rabbis and their most famous books. And finally, the occasional appearance of encoded word pairs near each other is simply a far cry from finding or proving (let alone decoding) any meaningful information encoded in the Bible.

For investment researchers, The Bible Code is just a wonderful example of the seductive appeal of random patterns found in large data sets. The book, if not also the paper, ignores all four guidelines discussed on page 29 of this issue: intuition, restraint, sensibility, and out-of-sample testing. Researchers-investment and biblical-ignore these at their own peril.

Data Mining is Easy
Seven Quantitative Insights into Active Management-Part 5

by Ronald N. Kahn

Why is it that so many strategies look great in backtests and disappoint upon implementation? Backtesters always have 95% confidence in their results, so why are investors disappointed far more than 5% of the time? It turns out to be surprisingly easy to search through historical data and find patterns that don't really exist.

To understand why data mining is easy, we must first understand the statistics of coincidence. Let's begin with some non-investment examples. Then we will move on to investment research.

The statistics of coincidence

Several years ago Evelyn Adams won the New Jersey state lottery twice in four months. Newspapers put the odds of that happening at 17 trillion to 1, an incredibly improbable event. A few months later, two Harvard statisticians, Percy Diaconis and Frederick Mosteller, showed that a double win in the lottery is not a particularly improbable event. They estimated the odds at 30 to 1. What explains the enormous discrepancy in these two probabilities?

It turns out that the odds of Evelyn Adams winning the lottery twice are in fact 17 trillion to 1. But that result is presumably of interest only to her immediate family. The odds of someone, somewhere, winning two lotteries-given the millions of people entering lotteries every day-are only 30 to 1. If it wasn't Evelyn Adams, it could have been someone else.

Coincidences appear improbable only when viewed from a narrow perspective. When viewed from the correct (broad) perspective, coincidences are no longer so improbable. Let's consider another non-investment example: Norman Bloom, arguably the world's greatest data miner.

Norman died a few years ago in the midst of his quest to prove the existence of God through baseball statistics and the Dow Jones average. He argued that "BOTH INSTRUMENTS are in effect GREAT LABORATORY EXPERIMENTS wherein GREAT AMOUNTS OF RECORDED DATA ARE COLLECTED, AND PUBLISHED" (capitalization Bloom's). As but one example of thousands of his analyzes of baseball, he argued that the fact that George Brett, the Kansas City third baseman, hit his third home run in the third game of the playoffs, to tie the score 3-3, could not be a coincidence-it must prove the existence of God. In the investment arena, he argued that the Dow's 13 crossings of the 1,000 line in 1976 mirrored the 13 colonies which united in 1776-which also could not be a coincidence. (He pointed out, too, that the 12th crossing occurred on his birthday, deftly combining message and messenger.) He never took into account the enormous volume of data-in fact, an entire New York Public Library's worth-he searched through to find these coincidences. His focus was narrow, not broad.

With Norman's passing, the title of world's greatest living data miner has been left open. Recently, however, Michael Drosnin, author of The Bible Code, seems to have filled it. (For details, see the book review.)

The importance of perspective to understanding the statistics of coincidence was perhaps best summarized by, of all people, Marcel Proust-who often showed keen mathematical intuition:

The number of pawns on the human chessboard being less than the number of combinations that they are capable of forming, in a theater from which all the people we know and might have expected to find are absent, there turns up one whom we never imagined that we should see again and who appears so opportunely that the coincidence seems to us providential, although, no doubt, some other coincidence would have occurred in its stead had we not been in that place but in some other, where other desires would have been born and another old acquaintance forthcoming to help us satisfy them. (The Guermantes Way, Cities of the Plain, Volume 2 of translation of Marcel Proust's Remembrance of Things Past [New York: Vintage Books, 1982], p. 178.)

Investment research

Investment research involves exactly the same statistics and the same issues of perspective. The typical investment data mining example involves t-statistics gathered from backtesting strategies. The narrow perspective says: "After 19 false starts, this 20th investment strategy finally works. It has a t-statistic of 2."

But the broad perspective on this situation is quite different. In fact, given 20 informationless strategies, the probability of finding at least one with a t-statistic of 2 is 64%. The narrow perspective substantially inflates our confidence in the results. When viewed from the proper perspective, confidence in the results lowers accordingly.

Four guidelines for backtesting integrity

Given that data mining is easy, how can we safeguard against it? Here are four guidelines for data mining integrity:

  a.. Intuition
  a.. Restraint
  a.. Sensibility
  a.. Out-of-sample testing

The intuition guideline demands that researchers investigate only those strategies with some ex ante expectation of success. Investment research should never involve free-ranging searches for patterns without regard for intuition.

The restraint guideline attempts to minimize the number of strategies investigated-i.e., to keep the broad and narrow focus similar. In the best case, researchers decide ex ante exactly which strategies and variants they will investigate, run their tests, and look at the answers. They do not go back and continually refine their investigations.

The sensibility guideline deletes results that seem improbably successful. Observed t-statistics that are too large may signal database errors or an improper methodology rather than a new strategy.

The fourth guideline, out-of-sample testing, is the statistician's answer to the curse of data mining. Coincidences observed over one data set are quite unlikely to reoccur in another independent data set.

Conclusions

Many backtesting results are not foolproof demonstrations of strategy value but merely coincidence. Four backtesting guidelines can help avoid data mining.

    -----Original Message-----
    From: asa-owner@lists.calvin.edu [mailto:asa-owner@lists.calvin.edu] On Behalf Of Vernon Jenkins
    Sent: 29 January 2005 21:39
    To: bivalve; ASA
    Subject: Re: Spellbound? (was Re: Cobb County)

David,

You said (28 Jan), " ...Genesis 1...makes it clear that methodological naturalism is highly appropriate for addressing pentultimate or antepentultimate origins...". And as Christopher wrote earlier (23 Jan), " The simple reason why the _supernatural_ must never be allowed 'a foot in the door' is that it cannot be tested, you cannot get a handle on it, and it is just another God-of-the-gaps arguments."

My case is simply this: the numero-geometrical features of the Bible's opening Hebrew words provide the _test-case_ that Christopher dismisses; for unless a reasonable _naturalistic_ explanation can be found for these phenomena, the principle of MN is permanently undermined. In other words, the findings of science - particularly as they offer a direct challenge to scriptural revelation - may well be invalid because of failures to take account of the possibility of supernatural interference.

By the way, I can't agree with Don who wrote (28 Jan), " We have no proof - and are unlikely ever to have proof - that these complexities actually required special divine input. (What form would such proof take, anyway?)." Whilst the overall probability of the many rare and unique features that we find crammed into these 7 opening words may be difficult to quantify, the impression created is that this must be _vanishingly small_. [And, let us observe that such situations have not deterred investigators from making far-reaching claims in other fields!]

It is worth spending a little time contemplating this wonder, for it comprises a multi-stage development. When the words were first recorded, they possessed a relevant literal meaning - nothing more. Some centuries later - following the adoption by the Jews of a system of alphabetic numeration - the letters and words acquired, in addition, the status of numbers - and it is at this point that most of the geometrical and other features of interest _became available_ for inspection (but spotted by none, apparently!). The discovery of the universal constant 'e' by Euler, the adoption of the Metric System (both in the 18th century) and the creation of a standard for cut paper sizes in the 1960s, consolidated the Genesis 1:1 edifice as it is now known..

I have written of this as a 'standing miracle' - something that will forever remind those of us who seek and value truth - of our Creator's Being and Sovereignty, and of His Grace in providing those who love Him with such firm assurances in these testing days.

    Vernon
    www.otherbiblecode.com

    ----- Original Message -----
    From: "bivalve" <bivalve@mail.davidson.alumlink.com>
    To: "ASA" <asa@calvin.edu>
    Sent: Friday, January 28, 2005 10:26 PM
    Subject: Re: Spellbound? (was Re: Cobb County)

>> (3) You have completey ignored the second matter I raised, viz the > lessons that might be learned from the widespread negative reaction to > news of the numero-geometrical features of Genesis 1:1. Wouldn't you > agree that these phenomena strongly challenge the view that > _methodological naturalism_ is the only valid basis for the proper > investigation of ultimate origins? It would be good and proper if you > were to consider joining me in disabusing others of this significant > error.
>
> I don't think the numero-geometrical features address whether methodological naturalism is relevant to ultimate origins.
>
> The text of Genesis 1:1 makes it clear that methodological naturalism will be unable to address ultimate origins. However, Genesis 1 also makes it clear that methodological naturalism is highly appropriate for addressing pentultimate or antepentultimate origins:
>
> Everything is created by an orderly God.
> There aren't any rouge powers, quarreling deities, or other factors that might disrupt the regular workings of creation, unlike pagan views.
> We were made to rule over creation. In order to do so well, as good stewards, we need to be able to determine how creation works.
>
> Thus, there are good reasons to expect a study of the ordinary workings of the universe to be very productive and informative.
>
> Dr. David Campbell
> Old Seashells
> University of Alabama
> Biodiversity & Systematics
> Dept. Biological Sciences
> Box 870345
> Tuscaloosa, AL 35487-0345 USA
> bivalve@mail.davidson.alumlink.com
>
> That is Uncle Joe, taken in the masonic regalia of a Grand Exalted Periwinkle of the Mystic Order of Whelks-P.G. Wodehouse, Romance at Droitgate Spa
>
>
Received on Sun Jan 30 17:47:23 2005

This archive was generated by hypermail 2.1.8 : Sun Jan 30 2005 - 17:47:25 EST