Idempotency, state management, and recovery

Abhinav
5 min readJan 19, 2024

A software system is a set of multiple software systems interacting with each other. These include services running logic, data stores, and systems that allow for asynchronous message relay. Most of the time, they work in sync. But often the quality of design is apparent only when a subsystem goes down.

In this blog post, instead of discussing this theoretically, we’ll take an example of a system, see at what points the system could fail, and how we can design the system to recover from failure.

Equity Trading Platform

For this example, let’s try to build a platform that trades equities, such as Zerodha. This platform allows placing of a Limit Order, Market Order, Buy/Sell, all via a 3rd party exchange. Let’s say the system we’re designing talks to another service that keeps store of the account balance and the equity balance that you have.

Here is a rough diagram of what we have —

The Equity Trading system can retrieve or update balances of equity / wallet from the Balances Service. So if a sell order goes through, the Equity Trading System can reduce the equity balance of that share and increase the cash balance for that user.

All orders are placed onto the stock exchange via a REST API. The order is filled in chunks, and every time a part of the order is filled, we receive an asynchronous message from the exchange. We’re not getting into details about how we get this message.

Points of Failure

This is a simplistic system, but it can have many points of failure. At each of these stages, their can either be a failure during the process, or a failure right after. The failure could be a network failure, or one of the systems might crash —

  1. When the Trading System is reserving an amount from the Balances Service. So the Trading System doesn’t know whether the reservation went through.
  2. Either the network or system failure while placing the order.
  3. A failure while receiving order fulfilment messages from the exchange.
  4. A failure crediting / debiting amounts back into the balances service.

Keeping Record of Each Checkpoint

With such a system, we cannot afford to let failure recovery be a manual process. And to automate failure recovery, we need to make sure of 2 things —

  1. Each step is recorded. We have to have a state that is updated after every step in the transaction takes place.
  2. Each step is idempotent. If the order is already placed on the external system, there needs to be a way for us to verify and avoid placing a duplicate order.

Idempotency

To achieve idempotency, we need to build build the system in a manner such that each action has a corresponding idempotency key. For example, when you make a call to the Balance Service, send the idempotency key, and the message body. That way, if there is a network failure, the next time you try, you can query the balance service using the idempotency key and find out if the request had gone through or not.

We use this concept in each interaction between two different microservices. This way each interaction between 2 systems is idempotent.

Record of State Change

The database needs to keep a record of each state the system goes into. For example, the creation of an order, then each time a part of order is filled, success, failure, cancellation, etc.

For example, the Order placement to Exchange will have a Processing, Success, or Failed state. Similarly, each action would have an entry into the database and an associated state. The recovery code now knows what has happened to each order, and during recovery it can pick up from that position and proceed.

Examples of Recovery

Scenario 1: The system starts placing the order to the exchange, and the system crashes. We don’t know whether the order was placed or not.

Here is how you make your system recover —

  1. If the order was successfully placed on the exchange, it will automatically send fulfilment information back to our system. In that case, we can mark the order as successfully placed, and then the order will close in a normal fashion.
  2. If there is no fulfilment information received in a certain time frame, we can write a cron to check with the exchange whether the order was placed or not, using the idempotency key. If it was not placed, we can simply refund the user’s balance, and close the order.

Scenario 2: The system does not receive fulfilment information from the exchange due to a network error, and the order is left open.

The recovery process will be —

  1. Periodically check with the exchange for an update on open orders. If the order is filled on the exchange, simply retrieve the fill information via a REST API, update the information on our system, and start closing the order.
  2. If our system receives a fulfilment message while it is manually retrieving the information from the REST API of the exchange, appropriate code needs to be written to ensure that the same information is not processed twice.

Conclusion

In both the above scenarios, the actual implementation is pretty difficult. The system needs to avoid race conditions, processing the same order twice, avoiding calculation errors, and a host of other complexities.

But the concept remains the same. Each atomic step needs to be recorded, and each step needs to be idempotent. This way, if there is a system failure, the recovery system can look at the current state of the system, and proceed towards closing the order.

Bonus: State Machine?

To a lot of you, this might seem like a good example to use a state machine. But I strongly advise against it.

A state machine firstly requires all of the different actions to be organised into states and sub-states, whereas many of the actions are different from each other. So the code should be able to judge the current state based on an aggregation of the actions in the database.

A state machine also requires one central source of truth to drive the next state from a given state and a trigger. For a system with as much variability and dependence on different domains as this one, that would mean having way too much domain logic in one single place. It is better to have the logic be distributed, but the actions be updated in a common database.

--

--