"Fault-Expectant" is a new term I coined for an alternative to "fault tolerant". It was inspired by reading the transcript of the JAVA LANGUAGE FUTURES session at the 1999 JavaOne conference, where there was a discussion of building robust large scale systems by mimicking biological systems' strategy of actually expecting all their parts to fail. I realized that fault tolerant (aka High Availability) systems still basically expect things to go right, but provide ways to recover from exceptional failures. A completely different mind set is needed though to act as if failure is so normal that faults are expected, and expected everywhere.
A part of the JavaOne discussion centered around the difficulty of building large systems because of the inability of programmers to generate perfect code above a certain size. The heart of the problem seemed to be our general expectation to the contrary. Unlike biological systems that expect to deal with constant subsystem failure, we basically expect the operations that we invoke to work. We don't, as a rule, build in fine-grained mechanisms to carry on in the event of faults. It reminded me of a networking company I worked with that had built error detection into their link layer protocol, but no higher layers. When I asked why, they said that since there was guaranteed message delivery from machine-to-machine there was no need to recheck it at the end-to-end level. When I asked, what if there was message loss inside a machine, they replied, "that would be a bug that we would have to fix". It didn't occur to them that protocol stacks could be built to succeed in spite of "bugs" at many levels.
So, what would "fault expectant" mean? After cataloging all the error detection, handling, & prevention techniques I could think of, and taking a step back, I saw that fault-expectant must mean some amount of non-determinism. When I ask somebody to do something, I must expect that it will fail, and automatically ask again (whether the same somebody, or someone else). But this "try again" strategy can't be shallow. It can't merely be like a protocol that re-requests a packet because it got a corrupted one. It has to be more like a game playing program that looks ahead several promising moves, but backtracks multiple moves when it finds itself painted into a corner. Like a database transaction that, not only rolls back changes if an error occurs before commit time, but actually tries a different algorithm on the re-attempt.
Non-deterministic systems are basically trying to optimize a solution using some (hopefully) converging algorithm. These systems have a long heritage in Artificial Intelligence literature. There are many algorithms that attempt solutions using components (e.g. heuristic or logic rules) that may or may not succeed, and so rolling back and trying a different path is a normal and not exceptional mode of operation. Prolog is an example of a computer language having built-in goal-seeking behavior, and backtracking after "failures". I put failures in quotes because they are more like SQL queries that return zero results; not really a failure, just didn't get the info we were looking for.
A corollary to this approach is that "instructions" must be at such a high level of abstraction, that they are effectively declarative (again like SQL). I.E. I need to specify what I want, and not how to do it. Specify idealized instructions then expect & allow imperfect executions of those instructions. The instruction follower interprets the instructions in view of its own circumstances and its own bag of tricks. The book "Meme Machine" (pg x – xii) talks about successful memes having idealized instructions that avoid any details which may fail. As the meme is perpetuated, each individual attempts to carry out those instructions and may do so imperfectly, but the essence of the idea carries on.
A Conceptual Framework
After reviewing my catalog of error mechanisms (enumerated in a separate blog entry), I came up with the following framework abstracted from a few recurring ideas. Rather than using terms heavy with connotations (e.g. client, server, message), I have used the following more neutral terms: requestor, request, result, worker, contract, match, & goal.