Continuous Availability and Active/Active Replication Systems

What is an Active/Active System?


Figure 1 — Active/Active Architecture

As shown in Figure 1, an active/active system is a network of independent processing nodes, each having access to a common replicated database, such that all nodes can participate in a common application.

In the most general case, the nodes are completely symmetric. Any transaction can be routed within the application network to any node which can read or update any set of data items in the database. This approach provides the most flexibility and maximizes system investment as requests can be load-balanced across all available processing capacity. If a node fails, users at the other nodes are unaffected. Also, the users at the failed node can be quickly switched to surviving nodes, thus restoring their services in seconds or less.

An active/active network contains at least two copies of the application database. All database copies are kept in synchronism so that any copy can be used for a transaction. If a database copy fails, all transactions are routed to a surviving copy.

Provided that the nodes and database copies are geographically distributed, active/active systems provide disaster tolerance for little or no additional cost when compared with active/passive configurations. If a disaster takes out a node or a database copy, there are others in the network immediately available to take their place.

Why Does an Active/Active System Work?

The availability of a system is determined by the amount of time that it is operational and providing application services (the system uptime) as compared to the amount of time that the application services are being denied to one or more users (the system downtime).

Although certain techniques can be used to improve the uptime of an individual system, such as increased operator training and using fault-tolerant components, a single system can never provide the levels of uptime necessary for critical business services. Even if the system hardware, software, and operations are 100% reliable (which is impossible), a single local event such as a fire, power outage, or flood, will cause an outage. Active/active technology provides the necessary redundancy to reduce downtime by orders of magnitude.

how an active-active system works

Figure 2 — How an Active/Active System Works

As show in Figure 2, if a node fails, users at that node can be switched to another operable node immediately. If a database fails, there is another consistent copy in the network that can be used. If a network component fails, alternate routes are provided. Using technology available today, failure recovery can be achieved in seconds or less. In short, let it fail (because it surely will), but fix it fast.

Regardless of the type of failure, far fewer users are affected when a node or database fails than with other disaster tolerant architectures. For example, in an active/backup (classic disaster recovery) architecture, any failure and switchover affects all users, and therefore usually involves the approval of upper level management, which may be hard to quickly obtain. In an active/active system, these types of failures only affect the users on that node or database, not the entire user population. Since other known-working nodes exist in the network, these users can be quickly switched to an alternate node.

Active/active systems eliminate the uncertainty that always exists when an active/backup approach is in place. Such uncertainty as to whether the failover will be successful often results in indecision, which further extends the outage duration. In an active/active system, when a failure occurs there is no massive leap-of-faith surrounding the failover to a backup system; all nodes in an active/active network are always known to be working, performing real work, at all times. For the active/active system, one only needs to re-route the users that were attached to the failed node to a surviving node, and this operation can often be masked from the users by network switching/routing software. Because of this extremely fast recovery, active/active systems provide continuous availability (recovery in seconds or subseconds), as opposed to active/passive systems which only provide High Availability (HA) (recovery in hours or minutes).

Active/active architectures allow for all purchased nodal processing capacity to be actively working on satisfying user requests of any type (e.g., read and/or update). There is no backup (passive standby) system sitting idly by waiting for another component to fail. All nodes are actively performing real work, all of the time.

Active/Active – Shadowbase Database Synchronization

A key requirement for implementing an active/active system is the synchronization of the databases. Each database copy must always be in a consistent state and must reflect the current state of the application. The Shadowbase solution accomplishes this task by automatically replicating changes made to each database copy to all other copies in the application network. The Shadowbase technology contains a powerful database replication engine that provides bi- or multi-directional replication between the database copies and guarantees that all copies remain in a consistent and correct state.

A concern that must be addressed in active/active database synchronization is that of data collisions. A data collision occurs when two nodes make a change to the same row in their database copy at substantially the same time. Each will replicate its change to the other database copy, thus overwriting the change made there. As a result, the database copies are different and both are wrong. The Shadowbase software can detect collisions and automatically resolve them in many cases. For those cases where Shadowbase replication cannot automatically resolve a collision, it supports embedding customer business logic into the replication engine to take whatever action is necessary to resolve the collision. There are also techniques to avoid data collisions in the first place, by application or data partitioning for example.

Related Solutions:
Related White Paper:
Related Case Study:
Related Information: