On-Board Maintenance for Affordable, Evolvable and Dependable Spaceborne Systems

This effort focuses on onboard guarded software upgrading (GSU), which is an important aspect of onboard maintenance for long-life missions. GSU aims for avoiding or minimizing mission performance loss or degradation due to software upgrading activities during a mission or due to system failure caused by residual faults in an upgraded version. GSU permits an upgraded software component to start its service to the mission seamlessly through onboard validation and guarded operation; and, in the case that the upgraded component is not sufficiently reliable and thus imposes an unacceptable risk to the mission, ensures that the system will be safely downgraded back by replacing the upgraded software component with an earlier version.

Guarded Software Upgrading

We take a crucial step in devising error containment and recovery methods by introducing the "confidence-driven" notion. This notion complements the "communication-induced" approach employed by a number of checkpointing protocols for tolerating hardware faults. The resulting error containment and recovery protocol is thus both message-driven and confidence-driven (MDCD). In particular, the MDCD protocol is based on a two-tiered approach: first, we discriminate among software components with respect to our confidence in them, and second, during onboard execution time, we adjust our confidence in the processes that are created from those software components, according to our knowledge about potential process state contamination caused by errors in a low-confidence component and message passing.

The MDCD nature of our approach makes it differ significantly from traditional software fault tolerance techniques. Most importantly, rather than prevent, by controlling and mediating the information flow, erroneous information from affecting a component, the MDCD approach allows the interacting processes to talk to each other without restriction but keeps track of potential error contamination to enable recovery actions. Accordingly, we provide correctness validation only at the system boundary, and make use of the validation result to adjust our confidence in individual processes in the system and to enable message-driven confidence-driven checkpoint establishments. Furthermore, we make use of multiple software versions that are inherently available to us and non-dedicated hardware redundancy, keeping both development and performance costs low.

There are a number of factors other than upgrading, such as complexity, testability, and test coverage, that may lead us to discriminate among interacting software components in a distributed system with respect to our confidence in their trustworthiness. Those factors suggest that the MDCD approach can be utilized as a general-purpose low-cost software fault tolerance technique for distributed embedded computing. Accordingly, we have accomplished an algorithm extension that permits checkpoint establishments to be based on fine-grained confidence adjustment and thus enables the MDCD protocol to serve a general class of distributed embedded systems. Moreover, we have successfully extended the algorithms so that the MDCD protocol becomes able to coordinate with an existing time-based checkpointing protocol in a synergistic fashion for simultaneous tolerance of software and hardware faults.

The algorithm generalization and extension make the GSU methodology feasible for, in addition to NASA's long-life missions, various commercial applications which are subject to online software upgrading and require high availability and/or safety, such as transportation systems, airline reservation systems, telephone systems, and financial services.