Wednesday, July 7, 1999

What Could Fault-Expectant Programming Mean?

"Fault-Expectant" is a new term I coined for an alternative to "fault tolerant". It was inspired by reading the transcript of the JAVA LANGUAGE FUTURES session at the 1999 JavaOne conference, where there was a discussion of building robust large scale systems by mimicking biological systems' strategy of actually expecting all their parts to fail. I realized that fault tolerant (aka High Availability) systems still basically expect things to go right, but provide ways to recover from exceptional failures. A completely different mind set is needed though to act as if failure is so normal that faults are expected, and expected everywhere.

A part of the JavaOne discussion centered around the difficulty of building large systems because of the inability of programmers to generate perfect code above a certain size. The heart of the problem seemed to be our general expectation to the contrary. Unlike biological systems that expect to deal with constant subsystem failure, we basically expect the operations that we invoke to work. We don't, as a rule, build in fine-grained mechanisms to carry on in the event of faults. It reminded me of a networking company I worked with that had built error detection into their link layer protocol, but no higher layers. When I asked why, they said that since there was guaranteed message delivery from machine-to-machine there was no need to recheck it at the end-to-end level. When I asked, what if there was message loss inside a machine, they replied, "that would be a bug that we would have to fix". It didn't occur to them that protocol stacks could be built to succeed in spite of "bugs" at many levels.

So, what would "fault expectant" mean? After cataloging all the error detection, handling, & prevention techniques I could think of, and taking a step back, I saw that fault-expectant must mean some amount of non-determinism. When I ask somebody to do something, I must expect that it will fail, and automatically ask again (whether the same somebody, or someone else). But this "try again" strategy can't be shallow. It can't merely be like a protocol that re-requests a packet because it got a corrupted one. It has to be more like a game playing program that looks ahead several promising moves, but backtracks multiple moves when it finds itself painted into a corner. Like a database transaction that, not only rolls back changes if an error occurs before commit time, but actually tries a different algorithm on the re-attempt.

Non-deterministic systems are basically trying to optimize a solution using some (hopefully) converging algorithm. These systems have a long heritage in Artificial Intelligence literature. There are many algorithms that attempt solutions using components (e.g. heuristic or logic rules) that may or may not succeed, and so rolling back and trying a different path is a normal and not exceptional mode of operation. Prolog is an example of a computer language having built-in goal-seeking behavior, and backtracking after "failures". I put failures in quotes because they are more like SQL queries that return zero results; not really a failure, just didn't get the info we were looking for.

A corollary to this approach is that "instructions" must be at such a high level of abstraction, that they are effectively declarative (again like SQL). I.E. I need to specify what I want, and not how to do it. Specify idealized instructions then expect & allow imperfect executions of those instructions. The instruction follower interprets the instructions in view of its own circumstances and its own bag of tricks. The book "Meme Machine" (pg x – xii) talks about successful memes having idealized instructions that avoid any details which may fail. As the meme is perpetuated, each individual attempts to carry out those instructions and may do so imperfectly, but the essence of the idea carries on.

A Conceptual Framework

After reviewing my catalog of error mechanisms (enumerated in a separate blog entry), I came up with the following framework abstracted from a few recurring ideas. Rather than using terms heavy with connotations (e.g. client, server, message), I have used the following more neutral terms: requestor, request, result, worker, contract, match, & goal.
  • The Request: A requestor, seeking a goal, makes requests to a set of workers where zero or more workers may return results. There must be a decision as to which, if any, of the results are acceptable. If the goal has not been met, there must be a decision as to how to proceed. In the event of detection of worker failure (e.g. lease expiration, unacceptable result, etc.), a decision must be made as to how to proceed (e.g. Ack/Nak, Tolerance, Do-Over, Fail-Over, etc.). If more than one result is acceptable, a decision must be made as to which to choose (e.g. majority-rules, most-fit, voting-stock, etc.). The most trivial case of this entire scenario is the statically compiled function call, followed by the increasingly less trivial scenarios: method calls, RPCs, ORB remote methods, web service requests, etc.
  • The Negotiation: A requestor and all responding workers must have a common contract between them that specifies the accepted protocol for each request. There must be a decision made as to which workers are a match for the proffered contract (e.g. compatible interfaces, domain expertise, etc.). There must be a decision made as to which compatible workers are actually chosen to be engaged (e.g. lowest cost).
  • The Transaction: The act of matching requestor and worker(s) for a particular request, engaging the worker(s), judging the result(s), and concluding or aborting the request. It is considered an atomic transaction that commits or rolls back system state.
  • The Goal: The act of "engaging a set of workers that are returning (possibly partial) results" is decoupled from the act of "determining whether the overall goal of the request has been met". This means that there may be a communal data space (e.g. JavaSpaces, blackboard systems, etc.) where (possibly partial) results are deposited and acted upon rather than results being returned directly via any individual worker's response. It also means that responses with a fuzziness factor may or may not meet goals which have quality requirements, and hence, work will continue on the original request.
  • The Coordinator: The determination of which transaction(s) [and hence request(s)] are generated next is made by a workflow coordinator. It may use a static state transition scheme (i.e. simple control flow), or it may use any number of dynamic search/optimization schemes (e.g. min-max trees, backtracking, etc.). Its task is to monitor that progress is being made towards its goal, and reset as needed when progress has ceased. All state transitions are transactions. The relationship between coordinators and workers is recursive, i.e. the internal logic of a worker is handled by its own coordinator. Multiple independent coordinators may monitor the same workflow (i.e. belt & suspenders). The coordinator(s) insure that failed requestors/workers are restarted or disposed of (aka garbage-collection).

    Fig 1. The transaction players and their shared data space context

No comments: