Sunday, August 12, 2018

GES scavenging and the hidden cost of link events

Somewhere around a year ago, we started using GES in production as the primary data store of our new loyalty system. The system stores two types of data.
  1. External services push batches of dumb downed events to the loyalty system. For example: a user logged on, played a game or participated in a competition. These events are transient by nature. Once the loyalty system has processed them, they only need to be kept around for a few days.
  2. When these ingress events are processed, they go through a mini pipeline in which each event is assigned a specific weight, for then to be aggregated and translated to a command. This command awards a user with virtual currency to spend in the loyalty shop and a number of points contributing to a higher rank - unlocking more priviliges. The state machine that stores a user's balance and points is backed by a stream which is stored indefinitely. Unless the user asks to be forgotten that is.
As a rough estimate, for every 1000 ingress transient events, only 1 needs to be stored indefinitely as part of a state machine.

When implementing this more than a year ago, I thought I had done my homework and knew how to make sure the ingress events would get cleaned up. First you make sure the $maxAge metadata is set on the streams you want to clean up, for then to schedule the scavenging process (other databases use the term vacuum). This worked without any surprises. Once scavenging had been run, I could see disk space being released. However, after a few months I started to become a bit suspicious of my understanding of the scavening process. GES was releasing less disk space than I expected.

Even though, we had been quite generous while provisioning the nodes with disk space, we would run out very soon. Much to my frustration, the "Storage is cheap" mantra gets thrown around too lightly. While the statement in essence is not wrong, for a database like GES that's built on top of a log, more data also means slower node restarts (index verification), slower scavenges and slower $all subscriptions.

GES has no built-in system catalog that allows you to discover which streams are taking up all this space. However, you can implement an $all subscription and count events per stream or even count the bytes in the event payload.

Inspecting the results, we found that the streams emitted by the built-in system projections contained a disproportionate amount of events. Most of them were ingress events and should have been long scavenged! As it turns out, this was a wrong assumption on my part. Built-in projections build new streams by linking to the original event (instead of a emitting a new one). But when an event is scavenged, events linking to the original event still linger around on disk. Although link events are much smaller than the original event usually - it's just a pointer, the bytes used to store the pointer and the event envelope still take up quite some space when there's billions of them!

Luckily, I was only using a small portion of the built-in projections. I created a custom projection that only created streams I was actually interested in, pointed my code in the right direction, stopped the built-in projections and deleted the now irrelevant streams.

Running the scavenging process after being able to delete all these streams was very satisfying. The scavenging process loops through all the transaction file chunks one by one. It reads a chunk and writes a temporary new one, only containing the events that haven't been deleted. Once it reaches the end of the file, it swaps out the newly written file with the old one. Since writes are slower than reads, this makes that scavenging is actually way faster when there's more to scavenge - or less data to be written to a new file. After all the chunks have been scavenged, the process merges the now smaller files into new chunks when possible. This process is quite transparant by design; all you have to do is list the files in the data directory when scavenging.

When this whole process was complete, used disk space went down from 410GB to 47GB! Having trimmed all this excessive data, scavenging is faster (hours not days), node restarts are faster and resetting an $all subscription makes me less anxious.

Tuesday, May 1, 2018

Amazon Redshift - Fundamentals

Late 2017, we set out to replace and upgrade our existing reporting and analytics infrastructure with something that would be a better fit for our workloads. Keeping costs and required maintenance at a minimum would be a nice plus, making for an easy sell. After a bit of research, it was obvious Amazon Redshift had the potential to tick all the right boxes. While steadily porting the most problematic workloads away from our existing infrastructure, I started writing an investigative article on the fundamental concepts of Amazon Redshift. I learned a lot studying each individual building block, allowing me to make some small, but impactful changes to our own setup along the way.

The outcome is a 10.000 word document (1 hour reading time), covering 7 topics:
  • Storage
  • Distribution
  • Importing data
  • Table maintenance
  • Exporting data
  • Query processing
  • Workload management

The text is available in three formats:

The project is open source and available on Github.

Thanks to everyone who proof-read earlier iterations and provided me with indispensable feedback.

I hope this work can teach you as much as it thought me. I'm looking forward to your feedback.

Wednesday, January 3, 2018

Consumed in 2017

Another year, another 17 books, 6 shows and 3 movies consumed. Here's this year's highlights.


1. The Zen and Art of Motor Cycle Maintenance

The author is a tormented soul on a quest to define quality. You're his passenger, driving shot gun on a CB77 Super Hawk, in for an exhausting intellectual journey through the high mountains of reasoning. You will often fear getting lost and feel slightly anxious that the driver might drive of a cliff any moment, but he won't. Once you see the top of the mountain for the first time, you'll be happy he doesn't make it too easy on you, and you'll be more appreciative of the road that took you there.
Throughout the process of fixing the machine things always come up, low-quality things, from a dusted knuckle to an accidentally ruined “irreplaceable” assembly. These drain off gumption, destroy enthusiasm and leave you so discouraged you want to forget the whole business. I call these things “gumption traps.” 
Peace of mind isn’t at all superficial to technical work. It’s the whole thing. That which produces it is good work and that which destroys it is bad work. 
It’s the style that gets you; technological ugliness syruped over with romantic phoniness in an effort to produce beauty and profit by people who, though stylish, don’t know where to start because no one has ever told them there’s such a thing as Quality in this world and it’s real, not style.

2. The Soul of a New Machine

The type of writing I wish there was more of. It's the closest I'll ever get to experience building a mini-computer. It makes one appreciate how much we're standing on the shoulders of giants. So much has changed in 40 years and even more hasn't changed at all. People will be people.
Adopting a remote, managerial point of view, you could say that the Eagle project was a case where a local system of management worked as it should: competition for resources creating within a team inside a company an entrepreneurial spirit, which was channeled in the right direction by constraints sent down from the top. But it seems more accurate to say that a group of engineers got excited about building a computer. 
In the sixties there was proposed a “National Data Bank,” which would, theoretically, improve the government’s efficiency by allowing agencies to share information. The fact that such a system could be abused did not mean it would be, proponents said; it could be constructed in such a way as to guarantee benign use. Nonsense, said opponents, who managed to block the proposal; no matter what the intent or the safeguards, the existence of such a system would inevitably lead toward the creation of a police state.

3. Winter is Coming: Why Vladimir Putin and the Enemies of the Free World Must Be Stopped

I've been trying to get better educated on the history of world politics and long-lasting international conflicts. Growing up in Western Europe with very little direct conflict, you're never really taught why other parts of the world seem to be so messed up and why they're so angry at us.

Although you might find Kasparov to be a bit too convincing, he has good reasons to hold strong opinions and is in a unique position to shed light on what's been happening under the Putin regime. It's like listening to someone who just escaped an abusive relationship. I was remembered of Hintjens' The Psychopath Code more than once.
Like a weed, evil can be cut back but never entirely uprooted. It waits for its chance to spread through the cracks in our vigilance. It can take root in the fertile soil of our complacency, or even the rocky rubble of the fallen Berlin Wall. 
If the road to hell is paved with good intentions, compromises on principles are the streetlights. 
He and his junta have turned the country into a petro-state, and exporting natural resources to an insatiable global market doesn’t require entrepreneurs or programmers, let alone writers and professors. 
Putin and his defenders abroad bragged about Russia’s rising GDP, but it was like taking the average temperature of all the patients in a hospital. 
The hypocrisy of condemning weak dictatorships while embracing strong ones destroys American and European credibility and undermines any attempt at global leadership; in fact, it seems to encourage smaller autocracies to aspire to greater ambitions.


1. Westworld

In a not so distant future, the rich will be spending their free time visiting amusement parks inhabited by lifelike robots. Once you pay your entry ticket, you can be the protagonist of any story you want. Maybe you want good value for money and go on an epic quest chasing a bad guy to the edge of the park, or maybe you just want to drink, gamble and maybe kill a few randoms for fun in the saloon just feet away from the drop-off point? Like playing Red Dead Redemption post-virtual reality.

The setting, cast and especially the story line are out of this world. Reading up on fan theories after the show is half the fun. It's amazing how many plausible theories were put out and how a small community is able to dissect every little scene looking for hints to figure out the park's mysterious past.

This show made me question if I understood what it is that makes us human. Aren't we all just the result of our pre-programmed genetics and the events we experience throughout our lives?


2. Stranger Things 2

I just finished watching this one last night. It's like reading a really good Stephen King novel, but in color. I noticed yesterday's season finale was 61 minutes long; good things do happen when you're not constrained to TV time.



Commutes have gotten earlier and longer this year. I've cut back on the technology podcasts in favor of a more broad range of topics.

1. The Joe Rogan Experience

It's a bit like sitting in on a conversation between two people at a bar. One person is a talkative and enjoyable guy, not afraid to ask questions and the other is happy to drop knowledge on a specific (and often fringe) topic. Topics range from diet and fitness, to economy and politics. Some episodes that stood out for me recently are the ones with Colin Moriarty and Nina Teicholz.

2. Conversations with Tyler

The same concept, being a one-on-one conversation covering a wide range of topics. However, this one is more formal and often a bit (too?) academic. This podcast on Marcroeconomics, Mentorship and Avoiding Complacency might give you a good idea on what to expect.

3. Dan Carlin's History X

Extremely thoroughly researched lectures on important periods of our history. The Destroyer of Worlds is a 6 hour long, but captivating piece on the history of nuclear warfare and filled in a lot of gaps for me.

Not sure what to watch next. Any recommendations?