Saturday, July 17, 1999

Component Software book ideas

I see that there are some ideas in my Fault Expectant Programming framework that are similar to ideas in "Component Software" by Clemens Szyperski.

On page 307, he talks about "atomic actions" that are invoked indirectly, and even queued up for execution.  This rhymes with my atomic transactions that are managed by a coordinator.

On page 46, there is a section titled "Specifying Time and Space Requirements" which rhymes with my notion of contracts having enough performance details that it enables bidding on requests by would-be workers.

Wednesday, July 7, 1999

Catalog of Fault Mechanisms

In the previous blog entry, What Could Fault Expectant Programming Mean?, I outlined a framework abstracted out of recurring themes in a series of error detection, recovery, and avoidance mechanisms that I had compiled.  Here follows that list of mechanisms (in no particular order).
  • Distributed Systems: remove single point of failure by using multiple components
  • Loose Coupling: avoids failure from spreading via ripple effects or brittleness
  • Ack/Nak: verify results and, on failure, repeat request (e.g. protocols)
  • Fail Over: verify results and, on failure, repeat request but to different component
  • Do Over: set aside failed requests or out-of-bound results and retry them later
  • Auto Restart: on failure of a component, it should auto-reset/restart and continue
  • Leases: dead-man switch on any allocated resource of a component (including the attention of its partners a la protocol timeout)
  • Event Driven: handle non-deterministic order of returned results from components
  • Fault Tolerance: ignore failed requests or out-of-bound results and continue rather than generating errors/exceptions
  • Backtracking: processing that expects to hit dead-ends and so backtracks to try other approaches  (e.g. Prolog logic rules, parsers that employ backtracking algorithms)
  • Redundant Components: issue parallel requests to multiple components and take a majority-rules vote on the result (but must know if results are deterministic or not; i.e. if more than one result can be valid then different components may return different, but still valid, results)
  • Evolutionary Programming: issue parallel requests to multiple components and result is taken from the most fit component i.e. chosen by the quality of the result rather than the result itself
  • Auctions: issue parallel requests to multiple components that compete on cost of service as a definition of most fit
  • Neural Nets: issue parallel requests to multiple components and take a vote biased by each component's dynamically adjusted success rating (i.e. each component votes its stock).
  • Fuzzy Logic: result returned by component is biased by a probability/quality rating
  • Transactions: state transitions are confined to successful atomic steps
  • Exceptions: asynchronous notification and response to problems where there is the ability to either continue or abort current operation.
  • Game Theory: multiple conflicting rule sets are projected via min-max trees to find a balanced result
  • Blackboard Systems: multiple workers on problem process as much as each is able and share common result state i.e. individual workers are not expected to produce a complete result or even any result at all.  E.G. JavaSpaces 
  • Workflow Models: combination of state-transition models and PERT dependency models to keep track of progress at a global level given multiple parallel workers and to reset to some given state(s) if results are not converging
  • Belt & Suspenders: multiple independent methods of verifying results or progress
  • Mobile Agents: workers dynamically move to more appropriate environments e.g. load balancing, fail-over, seek more reliable communications, etc.
  • Design by Contract: assertions to detect failure on the part of either the requestor or the worker e.g. Cleanroom techniques, Component Interfaces, Strong Types, policy driven security managers, etc.
  • Pattern Matching: non-trivial matching of requestor interfaces with worker interfaces to allow more flexible and dynamic establishment of contracts e.g. KQML , JATLite

What Could Fault-Expectant Programming Mean?

"Fault-Expectant" is a new term I coined for an alternative to "fault tolerant". It was inspired by reading the transcript of the JAVA LANGUAGE FUTURES session at the 1999 JavaOne conference, where there was a discussion of building robust large scale systems by mimicking biological systems' strategy of actually expecting all their parts to fail. I realized that fault tolerant (aka High Availability) systems still basically expect things to go right, but provide ways to recover from exceptional failures. A completely different mind set is needed though to act as if failure is so normal that faults are expected, and expected everywhere.

A part of the JavaOne discussion centered around the difficulty of building large systems because of the inability of programmers to generate perfect code above a certain size. The heart of the problem seemed to be our general expectation to the contrary. Unlike biological systems that expect to deal with constant subsystem failure, we basically expect the operations that we invoke to work. We don't, as a rule, build in fine-grained mechanisms to carry on in the event of faults. It reminded me of a networking company I worked with that had built error detection into their link layer protocol, but no higher layers. When I asked why, they said that since there was guaranteed message delivery from machine-to-machine there was no need to recheck it at the end-to-end level. When I asked, what if there was message loss inside a machine, they replied, "that would be a bug that we would have to fix". It didn't occur to them that protocol stacks could be built to succeed in spite of "bugs" at many levels.

So, what would "fault expectant" mean? After cataloging all the error detection, handling, & prevention techniques I could think of, and taking a step back, I saw that fault-expectant must mean some amount of non-determinism. When I ask somebody to do something, I must expect that it will fail, and automatically ask again (whether the same somebody, or someone else). But this "try again" strategy can't be shallow. It can't merely be like a protocol that re-requests a packet because it got a corrupted one. It has to be more like a game playing program that looks ahead several promising moves, but backtracks multiple moves when it finds itself painted into a corner. Like a database transaction that, not only rolls back changes if an error occurs before commit time, but actually tries a different algorithm on the re-attempt.

Non-deterministic systems are basically trying to optimize a solution using some (hopefully) converging algorithm. These systems have a long heritage in Artificial Intelligence literature. There are many algorithms that attempt solutions using components (e.g. heuristic or logic rules) that may or may not succeed, and so rolling back and trying a different path is a normal and not exceptional mode of operation. Prolog is an example of a computer language having built-in goal-seeking behavior, and backtracking after "failures". I put failures in quotes because they are more like SQL queries that return zero results; not really a failure, just didn't get the info we were looking for.

A corollary to this approach is that "instructions" must be at such a high level of abstraction, that they are effectively declarative (again like SQL). I.E. I need to specify what I want, and not how to do it. Specify idealized instructions then expect & allow imperfect executions of those instructions. The instruction follower interprets the instructions in view of its own circumstances and its own bag of tricks. The book "Meme Machine" (pg x – xii) talks about successful memes having idealized instructions that avoid any details which may fail. As the meme is perpetuated, each individual attempts to carry out those instructions and may do so imperfectly, but the essence of the idea carries on.

A Conceptual Framework

After reviewing my catalog of error mechanisms (enumerated in a separate blog entry), I came up with the following framework abstracted from a few recurring ideas. Rather than using terms heavy with connotations (e.g. client, server, message), I have used the following more neutral terms: requestor, request, result, worker, contract, match, & goal.
  • The Request: A requestor, seeking a goal, makes requests to a set of workers where zero or more workers may return results. There must be a decision as to which, if any, of the results are acceptable. If the goal has not been met, there must be a decision as to how to proceed. In the event of detection of worker failure (e.g. lease expiration, unacceptable result, etc.), a decision must be made as to how to proceed (e.g. Ack/Nak, Tolerance, Do-Over, Fail-Over, etc.). If more than one result is acceptable, a decision must be made as to which to choose (e.g. majority-rules, most-fit, voting-stock, etc.). The most trivial case of this entire scenario is the statically compiled function call, followed by the increasingly less trivial scenarios: method calls, RPCs, ORB remote methods, web service requests, etc.
  • The Negotiation: A requestor and all responding workers must have a common contract between them that specifies the accepted protocol for each request. There must be a decision made as to which workers are a match for the proffered contract (e.g. compatible interfaces, domain expertise, etc.). There must be a decision made as to which compatible workers are actually chosen to be engaged (e.g. lowest cost).
  • The Transaction: The act of matching requestor and worker(s) for a particular request, engaging the worker(s), judging the result(s), and concluding or aborting the request. It is considered an atomic transaction that commits or rolls back system state.
  • The Goal: The act of "engaging a set of workers that are returning (possibly partial) results" is decoupled from the act of "determining whether the overall goal of the request has been met". This means that there may be a communal data space (e.g. JavaSpaces, blackboard systems, etc.) where (possibly partial) results are deposited and acted upon rather than results being returned directly via any individual worker's response. It also means that responses with a fuzziness factor may or may not meet goals which have quality requirements, and hence, work will continue on the original request.
  • The Coordinator: The determination of which transaction(s) [and hence request(s)] are generated next is made by a workflow coordinator. It may use a static state transition scheme (i.e. simple control flow), or it may use any number of dynamic search/optimization schemes (e.g. min-max trees, backtracking, etc.). Its task is to monitor that progress is being made towards its goal, and reset as needed when progress has ceased. All state transitions are transactions. The relationship between coordinators and workers is recursive, i.e. the internal logic of a worker is handled by its own coordinator. Multiple independent coordinators may monitor the same workflow (i.e. belt & suspenders). The coordinator(s) insure that failed requestors/workers are restarted or disposed of (aka garbage-collection).

    Fig 1. The transaction players and their shared data space context