Improving Availability via Staggered Systems

Copy link to clipboard
Email link
Print

Simultaneous Failures

The reliability of a redundant system is optimized by minimizing the probability that both systems will fail simultaneously. If they both have the same failure probability distribution, then when one system is most likely to fail, so is the other system.

Previous Methods

Previous methods for calculating estimated availability from any point in time are flawed because they are based on memoryless random variables. The calculation of the average time to the next failure is always the same regardless of how long a system has been in service.

Staggering System Start Times

By staggering the system starting times so that their probability distributions are not aligned, the time that the two systems are most likely to fail are different. When one system is most likely to fail, the probability that the other system will fail is significantly reduced. Therefore, the probability of a dual system failure is reduced. Redundant system reliability can be greatly enhanced by staggering the starting times of the two systems. This strategy applies to both hardware failures and software failures.

Articles:
Improving Availability via Staggered Systems Part 1: MTTF — Mean Time To Failure

Improving Availability via Staggered Systems Part 2: Mitigating Redundant Failures via System Staggering

Improving Availability via Staggered Systems

Share This!

Simultaneous Failures

Previous Methods

Staggering System Start Times