Sunday, April 24, 2016

Using a batch layer for fast(er) aggregations

In the oldest system I'm maintaining right now, we have an account aggregate that, next to mutating various balances, produces immutable financial transactions. These transactions are persisted together with the aggregate itself to a relational database. The transactions can be queried by the owner of the account in an immediate consistent fashion.

The table with these transactions looks similar to this:

There's an index on the timestamp, the account identifier and the transaction type, which allows for fast enough reads for the most common access patterns which only return a small subset.

In a use case we recently worked on, we wanted real-time statistics of an account's transactions over its life time.

Running this query would seek the account id index, to look up all rows that match given predicate. In case one account has tens of thousands of transactions, this results in a high amount of reads. In case your database fits into memory, SQL Server can probably satisfy your query looking in its buffer cache. Although this still has overhead, it's supposed to be a lot faster than when SQL Server is forced to do physical reads - reading pages straight from disk. In this case, where transactions are often years old, and the database does not fit into memory, odds are high that SQL Server will be reading from disk - which is dog-slow.

One option would be to create a new covering index (including columns like CashAmount etc) for this specific workload. The problem is that indexes don't come for free. You pay for them on every write, and depending on your performance goals, that might be a cost you want to avoid. It might even be impossible, or too expensive to create such an index on environments that have no maintenance window and no license that allows for online index builds. Assuming that when you don't own said license, you don't have read replicas available either.

Considering the workload, the never-changing nature of financial transactions and constraints in place, we applied Lambda Architecture theory on a small scale, starting by building daily aggregations of transactions per account.

This translates into scheduling a job which catches up all days, by performing a query per day and appending the results to a specific table.

On our dataset, this compresses the transaction table by a factor of more than 300. Not just that, by separating reads from writes, we give ourselves so much more breathing room and options, which makes me sleep so much better at night.

As you probably noticed, for real-time statistics on this data, we're still missing today's transactions in this table. Since today's transactions are a much smaller subset and likely to live in SQL Server's cache, we can query both the batch table and the transaction table, to eventually merge the results of both queries. For our use case, resource usage and query response times have dropped significantly, especially for the largest accounts.

I don't see it happening in the near future, but in case the usage of these queries grows, we can still borrow more Lambda Architecture practices and push further.

4 comments:

  1. In situations where you only need a subset of columns but don't know which exactly before hand columnstore indexes can give you a dramatic speed increase as well.

    ReplyDelete
    Replies
    1. Yes, awesome feature, _if_ your version/license supports it :-/

      Delete
  2. In other words, you've written a projection in sql?

    ReplyDelete
    Replies
    1. Sure, if you categorize each function where you transform a temporal immutable dataset to new state as a projection. I've seen different semantics being used across different communities.

      Delete