January 9, 2010

High Availability

The availability of a system is a measure of its readiness for usage and is a component of a systems dependability

It is possible to approach 100% availability with messaging-based systems because multiple copies of a message consumer can be running and listening to the message bus at the same time which act as a pool of consumers all working to process a message. This is called parallelization and allows the deployment of as many fail-over consumers as may be required to ensure that the logic provided by the consumers is available at all times.

Through parallelization, a message consumer can be replicated throughout the data processing environment, observing messages on the same message group. This allows all the consumers to receive a copy of the message and process the data therein at the same time. The message producer never needs to know where these message consumers are running or how many of them there are. One or more of these replicated consumers may fail and the message will still be processed. This is one of the big benefits of broadcasting data through a message bus.

There are times, though, when some level of coordination will be required between the parallelized message consumers. At the very least, the amount of redundant processing is wasteful but more likely the redundant processing will result in logic errors such as multiple entries being posted to a ledger. Tome level of coordination is therefore required to reduce or eliminate the likelihood of duplicate processing. This coordination can take many forms and there are a variety of design patterns which can address this issue. Fortunately, mos message infrastructures have a feature which allows coordination between all parallelize consumers; message queuing.

Message Queuing

Nearly all message infrastructures have a quality of service (QoS) for the delivery of messages known as queuing. When message brokers utilize queues in message delivery, the messages are placed in storage (memory, disk or both) from which message consumers retrieve their messages when they are ready to process the message data. The message broker removes the message from storage once the message is successfully received by a message consumer and the message becomes unavailable for retrieval. This ensures that only one message consumer will receive the message for processing.

Message queues are a First In, First Out (FIFO) construct which ensures messages are made available to consumers in the order they were received by the broker. Some brokers allow message prioritization, message read-ahead (i.e. peeking) and some even allow for message push-back. These are qualities of service supported to varying degrees by some and not all message broker technologies so one should not design their systems to utilize them unless it is acceptable to become dependent on a particular messaging vendor. The common QoS for message queuing is FIFO delivery to only one message consumer.

When using message queuing, it is a simple matter to have multiple versions of the same message consumer running on different hosts in different sites all retrieving messages from a message group with message queue message delivery. In this configuration, a deployment can suffer a loss of one or more message consumers without a loss of service. Therefore parallelizing message consumers on a message queue is a relatively simple and efficient way to reduce failure points and to ensure high availability.

Consider an example where messages containing sales orders are sent to a message group for processing. Once processed, a message is returned to the sender of the sales order message as confirmation. The system would have a single point of failure if there were only one sales order processor listening for sales order messages. If that processor failed, then the sales order system would be unavailable. If the architect placed multiple sales order processors on the message group, all the currently operating sales order message processors would receive and process the message resulting in multiple orders in the system. If all these message processors were instead listening to the sales order message queue, then the message broker would only let one of the message consumers process the message resulting in only one sale order being entered in the system.

Load Balancing

When components are parallelized and using message queuing, each component handles processing according to its capabilities. When a component is busy processing a particularly complex operation, it is unavailable for processing the next message sent to the message group. It will be the other instances which will receive the subsequent messages for processing. This results in automatic load balancing where messages are consumed by bus participants only when they are available; the message broker will not attempt to deliver a message to a message consumer while it is busy processing a previous message or the component has experienced a logic error and has stopped processing. Only message consumers which are operational and ready to process the next message receive messages.

While not as configurable as some load balancing systems, this load balancing is achieved free as a by-product of parallelizing using message queues. Each message consumer can take as long as it needs to process the message and retrieve the next message from the queue only when it is ready. If messages begin to pile up in the message queue, then additional instances of the message consumer can be deployed to handle the load, but that will be covered in another article entitled “Horizontal Scalability”.

Fail-over / Fall-back

Another by-product of using message queues in parallelized deployments is Fail-Over processing when a component fails and the subsequent Fall-back when the component is restored. Although these concepts are commonly used to describe the operation of connection-oriented, client-server relationships, the results and qualities of these concepts are achieved through the use of “connection-less” message bus deployments.

In a client-server design, if the server fails the client must find, or “fail-over” to, another server in order to have its requests processed. The client needs special logic to detect the failure in the service to which it is currently connected, break the connection, locate another server, establish a connection and resume processing. Subsequently, the client needs logic to detect when the original server is restored so it can “fall-back” to the primary server and continue its processing. The logic in the client can be rather complex and tightly coupled to the particular server deployment in the data center. All this combines to make fail-over and fall-back operations difficult to implement, operate and maintain.

Should a parallelized component fail, the remaining components will continue to operate by pulling messages off the message bus and processing the data in those messages. In effect, fail-over is accomplished automatically because the failed component is not available to pull the message from the bus while its peers are. When the failed component comes back on-line, it immediately begins processing as work becomes available by grabbing the next message from the message queue effectively providing fall-back.

Maintenance Scenario

Even when a system needs to be upgraded, a system can remain 100% available and not suffer any downtime throughout the upgrade. Consider the following real-world business case.

An authentication system has been deployed which accepts encrypted credentials from clients through a message group which has a queued delivery scheme. Several message consumers have been deployed throughout the data center on separate hosts which utilized a clustered database for its data store. The result is that there are no single points of failure in the system with the services and the data store operating in a high availability mode.

The operations staff are tasked with upgrading the authentication service to one which uses not the data base but the corporate directory through the Lightweight Directory Access Protocol (LDAP). The database has been locked from updates and all the security principals along with their credentials have been imported into the directory.

The new message consumers providing the authentication service through the LDAP are deployed while the existing message consumers which use the data base are left on-line. The result is both version of the message consumers are running and authenticating credentials. Some of the credentials are being authenticated via the database and the rest are being authenticated via LDAP to the corporate directory.

Operations carefully watches the logs of the new components for any indications of problems and the operation of the new components are verified before the migration is completed. After no errors are observed in the logs and the help desk reports no clients have reported any issues the last step of the service update is started.

The older components which utilize the data base for credential authentication are terminated individually. As each one is terminated, the system logs are scanned for any errors. After all the original authentication services are terminated, the data base is backed-up and scheduled for purging.

The result is the authentication system on which all the other systems depend has been completely swapped out without loss of service. Authentication services were available at all times throughout the maintenance window and none of the end users or dependent system were aware of the service upgrade.

Versions of the above scenario has been performed numerous times in high volume production environments within critical business systems with no loss of availability. In all cases, the systems suffered absolutely no downtime and maintained 100% availability.


Utilizing a message bus allows one to deploy highly available systems through the use of parallelization of message consumers. By using a queued message delivery, the deployment can eliminate possible duplicate processing of messages and avoid having to code coordination logic in the message consumers which use a broadcast delivery method.


About This Blog

This is a low frequency blog on implementing SOA designs using message brokering. It is a list of practical tips developed over several years deploying service oriented systems for targeted and mass deployments in the telecommunications carrier and electric utility markets.

Recent Articles

Blog Archive

   Copyright © Steve Coté - All Rights Reserved.

Back to TOP