Sunday, November 9, 2014

Splitting hot aggregates

When you visit a real casino, the constant busy-ness is overwhelming; players spamming buttons, pulling levers, spinning the wheel, gambling on the outcome of sports games, playing cards, feeding the machine, cashing out, breaking bills, ordering drinks or even buying a souvenir. A single player will easily generate a thousand transactions in one sitting.

When you look at an online casino, this isn't very different. In the system we inherited, the biggest and busiest aggregate by far is a user's account. Basically every action that has money involved, leads to activity on this aggregate.
This makes sense. An account is an important consistency boundary, if not the most important one. Casino's can't afford to have people spend more than their account's worth.

Since we're applying optimistic concurrency, bursts of activity would occasionally lead to transactional failures. Looking at a real casino, it's easy to see why they aren't running into these types of issues.
In a physical casino, it's only the owner of the wallet that gets to access it. Casino employees are not allowed to take a player's wallet out of his pocket to make a transaction. There is no concurrent use of a player's wallet: single spender principle.
Online on the other hand, we aren't constrained by common courtesy and have no problem reaching into a user's pocket. It's common to have a user playing the slots, while we automatically try to pay out a sportsbetting win once the results of a game are in.

Mapping out an aggregate's eventstream on a timeline is a great way to visualize its lifecycle and usage patterns. When we did this for an account, we came up with something that looked like this.

Activity peaks when a user starts a game. Each bet and each win drags in the account aggregate. When you know that some players make thirty bets per minute, it should be of no surprise that other processes accessing the account in the background might introduce transactional failures.

Inspired by a real casino, I wonder if users online would appreciate it if we stayed out of their pockets and let them do it for us instead.
Instead of paying out sportsbetting winnings automatically, we could notify a user that his bet was settled and that he can head over to his bet and cash out the winnings to his account any time.
The same goes for games; instead of cashing out wins to a player's account after each bet, we could - like in a casino - cumulate all winnings in the slot machine itself, also known as a game session, for the player to cash out by pushing a button once he's done playing. To reduce the amount of small bets taken from the account, we could also encourage users to feed the slot machine before they start playing.

In practice, we would extract behaviour out of the account aggregate and move it into the sportsbet and game session aggregates. It wouldn't be until the end of their lifecycles, that we would involve the account aggregate to move money around.

By spreading activity to other and shorter lived aggregates, and having the player do a bit of our work, we could reduce the amount of concurrency on the account aggregate and end up with less transactional failures.

But can we really expect of users to cash out manually? Probably not, but we can still use most of the mechanics we just came up with, but cash out automatically. We can cash out winnings automatically when a user leaves a game session. We can queue up sportsbetting winnings and cash out when a user isn't playing a game.

By exploring alternatives, we discovered that we can work the model to reduce activity and concurrency on the account aggregate, lowering the chances for transactional failures. Now, it's only fair to say that there are other, more technical, options. The most obvious one would probably be making the existing transactions on the account aggregate shorter, also lowering the chance of concurrent use of the account.

I can't help but guess the actor model might be a better fit for this type of problem.


  1. Nice topic, Jef.

    As an alternative you could make the account a "real" serialization boundary, i.e. only allowing one writer at a time, thereby mitigating the "concurrency" altogether. Of course this introduces more latency as all actions for that aggregate sit in a queue - however virtual it may be - in front of it. As you've found out, it's often useful to carefully look at the various consistency boundaries within the aggregate to see if you can tease things apart. One might have been too enthusiastic in the past, lumping behaviors together because they fit the "noun". Insights, don't you love 'em ;-) Getting back to our queue, that virtual queue could be a priority queue, because, let's face it, not all behaviors are created equal. Some are more important to do "now", while others could be done later. Giving players their money, could be done "later" if traffic is just too busy (I know, I'm repeating what you've said). Or they could opt-in to do so explicitly when they're running out of money. In the mean time, while this "money won" is in transit between a game and a player's wallet (Why call it an account when it really is a wallet? Am I splitting hairs yet?) could your business make money off of it? Imagine cashing-out "now" incurs a penalty (money wise) on the actual amount they get just because they get it faster. Would it really scare players away? If not, how many players would use that feature? How much is the development effort and optional extra hardware to handle such players? Obviously, I'm going out on a limb here, since ... what do I know about gambling, right? ... but it sounds like you're on the verge of a break-through. Historical analysis of a player's gambling behavior could also be used as input to pre-allocate money from his wallet when he's about to embark on another game of the same type. In the end you're trying to keep them happy, giving them a shot of endorphins if and when they win, getting them to spend more, etc ...

    On the other hand, should your users incur a penalty due to your scaling issues? It's hard to tell, you're offering a service, making money off of what the users do, ... a lot depends on the business model chosen.

    Just random thoughts ...

    1. Inspirational, as always Yves. I'll be definitely taking these ideas with me going forward!

      On a side note, account is something we've been dragging along so long it has become a first class citizen of the language by now.