[IPP] Fwd: [info] Software Rejuvenation

Thu Jan 11 21:27:26 UTC 2018

Hi,

The recent paper on Software Rejuvenation from Lawrence Bernstein,
who invented the concept in the 1990s.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Co-Chair - TCG Trusted Mobility Solutions WG
Chair - Linux Foundation Open Printing WG
Secretary - IEEE-ISTO Printer Working Group
Co-Chair - IEEE-ISTO PWG Internet Printing Protocol WG
IETF Designated Expert - IPP & Printer MIB
Blue Roof Music / High North Inc
http://sites.google.com/site/blueroofmusic
http://sites.google.com/site/highnorthinc
mailto: blueroofmusic at gmail.com
Jan-April: 579 Park Place  Saline, MI  48176  734-944-0094
<(734)%20944-0094>
May-Dec: PO Box 221  Grand Marais, MI 49839  906-494-2434 <(906)%20494-2434>

---------- Forwarded message ----------
From: Tatourian, Alan <alan.tatourian at intel.com>
Date: Wed, Aug 9, 2017 at 1:15 AM
Subject: [info] Software Rejuvenation
To:

Software Rejuvenation

Lawrence Bernstein and Dr. Chandra M. R. Kintala

Stevens Institute of Technology

Here is a design approach that makes software more trustworthy, called
software rejuvenation. It is a periodic, pre-emptive restart of a running
system at a clean internal state that prevents latent faults from becoming
future failures. It was used in systems ranging from a Lucent billing unit
to NASA's long-duration space mission to Pluto, and is implemented in IBM's
Netfinity resource manager. It is easy to apply, uses very little central
processing unit time, increases software reliability by two orders of
magnitude, and is recommended for all software-intensive systems.

Software modules comprise a large part of life- and mission-critical
systems. System crashes are more likely to be the result of a fault in the
software than in the hardware. In spite of our best efforts at removing the
errors/faults (bugs1) before deploying those systems, it is wise to assume
that bugs remain in the system and those bugs often lead to failures
(crashes).

Software fault tolerance is aimed at tolerating those residual faults by
building mechanisms to watch for failures and recover from them [1, 2].
Fault tolerance is a reactive approach: Failures usually happen at
unexpected times, and the built-in mechanisms to recover from those
failures will kick-in to restart the system and the service. However, these
unscheduled interruptions in service are expensive and can be
life-threatening. This article describes a proactive, preventive technique
called soft-ware rejuvenation that prevents faults from becoming failures.

Lawrence Bernstein observed in 1990 that faults/bugs, when triggered in
soft-ware, do not always cause failures/crashes immediately but take the
system into a state where it begins to decay2. This decay has symptoms of
memory leakage, broken pointers, unreleased file locks, numerical error
accumulation, etc., causing gradual degradation in availability of service
and data quality and eventually leading to a failure/crash.

Based on this observation, a new method to enhance the dependability of a
software system, called software rejuvenation, was introduced in 1995 by
Kintala and his colleagues in Bell Labs [1, 3]. Software rejuvenation is a
proactive approach that involves stopping an executing process periodically
or when a failure is imminent, cleaning up the internal state of the
sys-tem, and then restarting it at a known healthy state to prevent a
predicted future failure.

Software rejuvenation is as intuitive as occasionally rebooting your PC,
except that it was never defined, implemented, modeled, and analyzed for
software systems before 1995 [3]. Shari Pfleeger used the term software
rejuvenation to mean, “…looking back at software work products to try to
derive additional information …” in her seminal software engineering book
[4]. Her use differs from ours as we focus on the execution of the software
during its mission, and she focuses on the software development process.

Modeling and Analysis

Software rejuvenation incurs overhead and should be done at a time when the
cost due to service interruption is mini-mal. Hence modeling the system to
find optimal rejuvenation times is crucial. A simple and useful model based
on continuous-time Markov chains was first introduced in [3] to analyze
software rejuvenation.

The Future

Software rejuvenation is ready for industry-wide deployment. It can make
software systems more trustworthy. Good designers will use it and move from
the state of the art to the state of the practice. It is a good design
practice for individual systems.

Software rejuvenation is one aspect of self-healing that has gained
research interest recently. There are some interesting new problems for
software rejuvenation in large-scale, networked, self-healing systems. We
describe some of those problems here and make some suggestions:

1. For networked applications, we need to monitor and gather the
availability and quality of all the required resources for the application
across the network, and then synthesize that gathered data and make a
prediction about possible failure of the application or a component in the
application. Network application monitoring might be hard to do in such a
generalized fashion. You can perhaps do it in a limited domain such as a
Voice over Internet Protocol (VoIP) application in an enterprise network.

2. Self-healing systems on a network need alternate paths for communication
between components to avoid an impending failure. This may be hard to do in
a generalized fashion. But in much the same way as in clustered systems
providing redundancy for centralized applications, you can perhaps provide
alternate communication paths for some self-healing applications (for
example, VoIP) using alternate service provider networks.

3. Modeling and implementation have several problems due to their
large-scale nature. What is a state in a large-scale system when state is
across sever-al products and systems in a network? Perhaps, you need to
model the system in a hierarchical, tree-structured fashion decomposing the
state into smaller units as you need it for analysis. Failure symptoms are
at a system/network (macro) level but rejuvenation actions are at a
component (micro) level; how do you correlate the two? This topic is
perhaps related to event correlation in network management. How do you do
rejuvenation efficiently in very large systems? Perhaps gradual load
shedding can be used. What is a safe (clean internal) state to back up to?
How do you back up to that state?

References

1.       Bernstein, L. “Software Fault Tolerance Forestalls Crashes: To Err
Is Human, to Forgive Is Fault Tolerant” in Advances in Computers 58. Highly
Dependable Software. Ed. M. Zelkowitz. Academic Press, 2003: 240-285.

2.       Lyu, M., Ed. Software Fault Tolerance. New York: John Wiley, 1995.

3.       Huang, Y., C. Kintala, N. Kolettis, and N.D. Fulton. Software
Rejuvenation: Analysis, Module and Applications. Proc. of 25th Symposium on
Fault Tolerant Computing FTCS-25, Pasadena, CA, June 1995: 381-390 <
www.ece.stevens-tech.edu/~ ckintala/Papers/RejuvFTCS25.pdf>. The Web site <
www.software-rejuvenation.com>, maintained by professor Trivedi at Duke
University, has a collection of follow-up research papers on the topic.

4.       Pfleeger, S.L. Software Engineering Theory and Practice. 2nd ed.
Prentice Hall, 2001: 496-502.

5.       Li, L., K. Vaidyanathan, and K.S. Trivedi. “An Approach for
Estimation of Software Aging in a Web Server.” International Symposium on
Empiri-cal Software Engineering, Nara, Japan, Oct. 2002.

6.       Vaidyanathan, K., R.E. Harper, S.W. Hunter, and K.S. Trivedi.
Analysis and Implementation of Software Rejuvenation in Cluster Systems.
Proc. of the Joint Intl. Conference on Measure-ment and Modeling of
Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June
2001.

7.       Tai, A.T., L. Alkalai, and S.N. Chau. “Onboard Preventive
Maintenance: A Design-Oriented Analytic Study for Long-Life Applications.”
Performance Evaluation 35.3-4 (June 1999): 215- 232.

8.       Bernstein, L., Y.D. Yao, and K. Yao. “Software Rejuvenation:
Avoiding Failures Even When There Are Faults.” The DoD SoftwareTECH News
6.2 (Oct. 2003): 8-11 <www. softwaretechnews.com>.

9.       General Accounting Office. “B-247094, Report to the House of
Representatives.” Washington, D.C.: GAO, Information Management and
Technology Division, 4 Feb. 1992 <www.fas.org/spp/starwars/gao/im9
2026.htm>.

10.   Bao, Y., X. Sun, and K. Trivedi. Adaptive Software Rejuvenation:
Degradation Models and Rejuvenation Schemes. Proc. of The International
Conference on Dependable Systems and Networks, San Francisco, CA, June 2003.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pwg.org/pipermail/ipp/attachments/20180111/ca5a6dde/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Software Rejuvenation.pdf
Type: application/pdf
Size: 327681 bytes
Desc: not available
URL: <http://www.pwg.org/pipermail/ipp/attachments/20180111/ca5a6dde/attachment.pdf>