Polling reliably at scale using DLQs

Execution of computer programs is blazing fast. It’s only practical to sometimes need a delay in execution. Some of such use cases are:

Scheduling a task for some time in the future;
Retrying a failed task with a backoff strategy

At smallcase, we place orders to buy or sell equities on clients’ behalf. When the order is placed, it’s execution status is not immediately known. We need to poll the partner broker to know the status of the order on the exchange.

Intraday traders will tell you that stock markets are time-sensitive. Any platform that caters to them needs to be fast and deterministic. Polling for order status needs to happen in a timely manner after regular intervals because an endless loading is never a pleasant sight.

This blog explains the reliability issues we faced with our legacy system for scheduling polling and how we fixed it.

Delay mechanisms:

We will discuss the following delay mechanisms we used in our platform:

Redis Keyspace Notifications
In-memory timers
Dead letter queues

Redis Keyspace Notifications

This approach is very simple. We heavily use Redis in our stack, so it was easy for us to use it for scheduling triggers. Redis allows to set keys with expiry using its SETEX command. For instance, setting a key TESTKEY with value TESTVALUE and expiry of 10 seconds can be achieved with:

>redis SETEX TESTKEY 10 "TESTVALUE"

Parallely, Redis provides Keyspace notifications for data changing events which can be subscribed by clients and can be used to trigger the delayed tasks. Key expiry events can be subscribed using:

SUBSCRIBE __keyevent@0__:expired

Pros:

Stateless application code: The application doesn’t need to store the tasks to perform and delays in its memory. It is outsourced to Redis.
Arbitrary delays: Delays of arbitrary length are possible without any extra configuration. Just changing the TTL parameter of SETEX command will work.

Cons:

Unscalable: This strategy works only until there’s only one instance of the application listening to these events. In case this application is to be horizontally scaled, each instance of the application will receive the event because of its publish-subscribe nature.
Unreliable: The keyspace notifications can get highly unreliable as the number of keys increase. This is clearly mentioned in the docs: > If no command targets the key constantly, and there are many keys with a TTL associated, there can be a significant delay between the time the key time to live drops to zero, and the time the expired event is generated.

This worked fine for us in the beginning. But once the orders increased, a lot of clients started reporting really long waiting periods before they could see the status of their orders. On debugging, we saw delays of the tune of 40x and this was the deal-breaker for us.

In-memory Timers

As a quick fix to this, we used in-memory timers. Timers are defined and executed by the application code itself and can be used to manage delays. Node.js provides a variety of in-built timer functions. Setting a 10-second timer is as simple as:

setTimeout(() => {
    console.log('This executes after 10 seconds');
}, 10 * 1000);

Pros:

Arbitrary delays: Delay time is controlled by the time argument only. Passing the required delay time in milliseconds is all that is required.
Reliability: Timers in the code is quite reliable. Deviation of only up to a few milliseconds is observed.

Cons:

Stateful application code: The application adds to its state whenever a timer is set up. It also increases the memory footprint of the application.
Unscalable: In-memory timers do not work well with horizontal scaling. The timer will always trigger the instance which set it in the first place, irrespective of the traffic distribution across the instances.

Timers worked well to mask the problem until we had the bandwidth to think about the problem and fix it for good. Now we needed a permanent and reliable solution that would scale.

Dead Letter Queues

After researching a bit about this topic, we decided to try Dead letter queues. DLQ is a sophisticated but robust way to delay message delivery used in message queues like Amazon SQS, RabbitMQ and ActiveMQ.

In very basic terms, a message with an expiry is put into a queue. The queue is instructed to take some action if the messages are expired without being read. If the expired messages are pushed to a different queue which is actively consumed by the target application, it simulates a delayed delivery of the message. Following diagram explains this concept:

We chose RabbitMQ based on AMQP 0-9-1 for our implementation due to the maturity of the protocol and its community.

Pros:

Stateless application code: The application doesn’t need to store any state. The messages in the queue represent the state in this case.
Scalable: Unlike Redis events, message delivery in MQs can be configured to use worker pattern which load-balances the consumers (subscribing instances of the application) to deliver the message.
Reliability: Message Queues are expected to be highly reliable in the delivery of messages. In our testing with RabbitMQ, the results were found to be close to in-memory timers

Cons:

Fixed delays: In RabbitMQ’s implementation of DLQs, the TTL is bound to the queue and applies to all the incoming messages. There is a way to set message level TTL but it has its own caveats. One of the most serious side-effects of using message-level TTL is that a message with higher TTL will block the queue and delay the processing of messages with lower TTL behind it, which is a bigger problem. Hence, we are using DLQs with queue level TTL.

The chart shown below analyzes the average delay observed in resolving the orders we handled in the past year and its relationship with the delay mechanism being used.

It is clear that the polling delay in Redis was large but consistent to start with but grew out of proportions quickly as the volumes increased. This was quickly fixed using timers but like all quick-fixes, it had a short life too. Finally, we revamped our systems and used DLQs. It is evident from the graph that DLQs brought consistency to the system.

We traded flexibility with robustness and reliability by using DLQs to manage the delays in our polling for order status. If you’ve solved a related problem in any other way, do let us know in the comments.

Delay mechanisms:

Redis Keyspace Notifications

Pros:

Cons:

In-memory Timers

Pros:

Cons:

Dead Letter Queues

Pros:

Cons:

You may want to read

Cancel reply

Popular smallcases

smallcase Guides

smallcase Calculators

Stock Portfolio Collections

All smallcases present in the articles are created by SEBI licensed entities. The disclosure of these entities can be found below:

INH200007645

INH100008717

INA000017541

INH100008513

INA100015717

INP000004946

INA000016825

INA000007623

INA200013770

INH000009445

INH000006077

INH000006448

INA000015701

INA000016436

INA000017231

INP000006749

INH000004635

INH300006607

INA300002022

INH000009630

INA200013798

INA300017038

INA000017198

INH200003208

INA300012547

INA000016463

INH000008677

INA000014915

INA100004608

INP000006253

INH000007216

INA100013205

INA300008614

INA200010904

INA000016056

INH000001139

INA000010584

INH100009488

INA100015115

INH200009935

INH000008312

INH100008638

INA100014426

INA000015747

INH000008075

INH000006156

INA300003616

INH000008552

INH100008726

INH000001469

INH000009047

INH100008799

INH100008939

INA100010402

INH200009032

INH200008653