Saturday, December 22, 2007

Clark Kent is (not) Superman

[Ed. Note. This is part one of a series found on "Existential Programming, the blog": "A Rose is a Rose is (not) a Rose"]

It delights me to find out that what I thought had been a particular nugget of wisdom, specific to building
Identity matching computer systems, actually has a deep principle at work. W
hile working on one of these systems, I learned the strategy of NOT merging all variations of an individual identity's name/address/phone/etc into a single canonical version. It turns out that the need to keep, and assign a unique key to, every variation of identity data (as opposed to only the "canonical" one) has deep roots in language itself...

While reading the Intellectual Devotional (which I highly recommend), I came across its page about "Philosophy of Language" and it had an immediate resonance with a project at my current client. The page describes the "problem of reference" where ideas about what a name "means" have been debated and changed over time.

One theory says that "names" don't have any meaning, in and of themselves, they merely refer to some
thing that has meaning. Hence, Shakespeare's quote "A rose by any other name would smell as sweet" summarizes the position that the word "rose" is not meaningful, and could be exchanged with any other word that refers to the thing "rose". That is why "gulaab" (the Urdu word for rose) can work just as well for speakers of Urdu.

Another more modern theory though, says that names not only refer to some thing, they also carry the connotation of "in what sense" is the thing being referenced. The book illustrates the example of Superman and Clark Kent both being names for the same thing (the being Superman), but they are not interchangeable. Clark Kent (mild mannered reporter) has a work address of the Daily Planet whereas Superman (superhero able to leap tall buildings) does not. It matters which name is used when talking about Superman.

So, in the same way that Clark Kent and Superman both refer to different aspects of the same entity, and are thus not interchangeable, a computer system managing legal entity identity data can not translate name/address variations into a single entity ID# when those variations actually refer to different aspects of the entity. For example, if there is data that is specific to a particular store branch, that branch needs its own well-known ID# even though it is only a portion of a single legal entity. Further, since legal entity names are not unique (I own two different corporations with the identical legal name), the entire name/address/phone/etc combination needs managing rather than separate "alternate name" lists. It is also not sufficient to support alternate name/address records merely as search aids that still ultimately result in the ID# of the entity-as-a-whole. Otherwise one would loose track of the fact that we were talking about Clark Kent, not Superman.

Friday, November 30, 2007

Odometer Game Redux

Well, after 35 years of pondering what I thought was an abstract mathematical puzzle, my "odometer game" has found a real-world application!

It turns out that my notion of "remarkable" numbers [i.e. numbers that are so remarkable that if the driver saw his odometer sitting on that number he would either honk his horn or point it out to his passengers] are just the ticket for finding "fake" ID numbers.

My current contract at a major bank found me looking for suspect ID numbers, Tax IDs, phone numbers, etc. in various customer databases. The bank employees entering this information would often get around the fact that these fields were "required" via entry of syntactically legal digit strings that were none the less meaningless. After viewing a few of these it quickly became obvious that they were related to my notion of remarkableness. Actual values found included:
0, 121212121, 000000000, 999999999(9), 111111111, 111111112, 222222221, 888888889, 188888888, 0999999999, 589999999, 255511555 (?)

So, rather than an explicit list of IDs to put on a watch list (as I was asked to find), it became clear that a better answer would have been to use an evaluation function that reported the remarkableness score for each value. A cutoff point could then be established to filter out suspicious values. Alas, while I have casually pondered the mathematics involved in scoring the remarkableness of a number, I've never actually tried to program it. But, now it has become more than an obscure puzzle, and shows signs of having "real world" value!

Saturday, October 13, 2007

The Odometer Game

Dear Dr. Douglas Hofstadter,
Having been a fan of yours since GEB (I had you sign my copy in '83 at UC Santa Cruz), I have always wanted to write to you about an "odometer game" I concocted about 1973 which touches upon several of your favorite themes: patterns, their recognition, and "human" vs "machine" intelligence. Following Hofstadter's law, it has taken longer to write you than I ever thought it would. ;-)

The thought-provoking part is imagining how a computer would ever "play" the game. It would involve mathematically defining "remarkable" odometer numbers, where "remarkable" is defined as any number that would intuitively cause the driver to remark to the other passengers: "Hey, look at the odometer!". The more likely a number is to cause a driver to say that, the higher its "remarkableness".

I've casually pondered the math for this for years. Let me know if you have already solved it.
sincerely,
Bruce Wallace

The Odometer Game

As many people have done over the years, I honked my horn when my odometer rolled over to all zeros [000000] (back when that only took 100,000.0 miles to do so). Later, when I put on another 11,111.1 miles [111111], I decided that an odometer reading of all ones was also worthy of a honk (a bit of a geek whimsy).

Since I had many long boring drives between college and parents, I came up with a little diversion which was to honk (or otherwise take note of) any "remarkable" odometer reading [where "remarkable" was any number that would make a driver point it out to the passengers].

I even accumulated imaginary points that mirrored the amount of "remarkableness" of the number. But, soon I realized I needed a reason to keep from simply taking note of EVERY number in my quest to build up my point total. I thought that maybe some function balancing total-points vs average-points-per-honk was needed. And to make things more sporting, I should lose points if I missed any numbers in a "pattern" once I had taken points for that pattern. In other words, if I took points for [000000] and [111111], I would lose points if I missed [222222]. (I.E., don't start a pattern if you aren't going to keep it up!)

So, [000000] was definitely remarkable, and so was [111111], [222222], [333333], etc. (Hmmm... [111111] seems less remarkable than [000000], and [222222] thru [888888] all seem less remarkable than either [000000] or [111111]...should they all get the same points?). Then came [123456]. And while less remarkable, [234567], [345678], etc. all seem pretty good.

Palindromes are very good, but [123321] seems more remarkable than [394493] or [825528]. And [121212] & [123123] are very good, but less so [838383] & [378378]. While [010101] and [999999] beat out [898989] & [888888] respectively, all seem good enough to take the points.

Round numbers like [010000], [020000], [030000], etc. seem nice because the pattern is anchored with [000000]. Actually, a number like [000000] meets lots of patterns at once: [aaaaaa], [ababab], [abccba], [abcabc], etc. (in addition to being the ultimate round number), so, it gets LOTS of points.

The Puzzle

Why are some numbers (i.e. digit strings) instinctively more "remarkable" than others? How would one model this mathematically? Patterns seem part of the answer, but a readily recognizable pattern is in [192837] even though it would seem very unlikely for a driver to make a passenger take note of that number/pattern.

And why are [000000], [111111], & [999999] all more "remarkable" than [222222] thru [888888]? Why is [123456] more than [012345], but [121212] and [010101] seem more of a toss up? Is the "simplicity" of the pattern the crux of "remarkableness"? How would one describe that "simplicity" mathematically (especially when 0 and 1 and 9 seem somehow more "simple" than 2 thru 8)? What grammar "parses" this string language?

Monday, May 21, 2007

Language Plateaus in Evolution?

In reading[1] about the different levels of human language competency that plateau at various ages (6, puberty, etc), it made me wonder if those capabilities mirrored those of our ancestors at various stages of evolution. Just as a human embryo looks like amphibians, etc as it is developing (in a mirror of DNA development over the ages), maybe language skill levels that jump in quantum leaps mirror primate evolution?

[1] Introducing Chomsky, John Maher, Judy Groves, Icon Books, 1996.