Improving Availability via Staggered Systems

The reliability of a redundant system is optimized by minimizing the probability that both systems will fail simultaneously. If they both have the same failure probability distribution, then when one system is most likely to fail, so is the other system. Previous methods for calculating estimated availability from any point in time are flawed because they are based on memoryless random variables. The calculation of the average time to the next failure is always the same regardless of how long a system has been in service. By staggering the system starting times so that their probability distributions are not aligned, the time that the two systems are most likely to fail are different. When one system is most likely to fail, the probability that the other system will fail is significantly reduced. Therefore, the probability of a dual system failure is reduced. Redundant system reliability can be greatly enhanced by staggering the starting times of the two systems. This strategy applies both to hardware failures and to software failures.

Articles:
Adobe PDF iconImproving Availability via Staggered Systems Part 1: MTTF — Mean Time To Failure



Improving Availability via Staggered Systems Part 2: Mitigating Redundant Failures via System StaggeringAdobe PDF icon