Wednesday, November 18, 2015

Slides from my talk "Evil by Design" at Build Stuff

Third time attending Build Stuff, first time doing a talk. I'm happy that it's out of the way and can now just enjoy the conference, but I'm even more excited that it was well-received! The talk should have been recorded, but you can already find the abstract and slides below.
Last year I ventured into the domain of (online) gambling. Given that the industry has been around since forever, I expected most problems to be of the technical kind. As it turned out, the struggle with technology was only part of a bigger problem; to move forward we needed to fully grasp the industry and its consumers. 
Events started out as a way to dismantle a legacy system, but quickly proved to be an effective tool to gain a deeper understanding of our domain. Visualising event streams, we discovered patterns that helped us identify what drives different types of users. 
Having a better understanding of what customers are looking for, we dove into existing literature to learn which techniques and models casinos use to cater for each type of user. We learned how to program chance while staying true to the Random Number God. Even when variance is brutal, casinos have enough data and tools to steer clear from the pain barrier. 
All of this entails interesting problems and software, but isn't my code damaging society? Or is gambling just another human trait?

Monday, November 16, 2015

Defining big wins

Casinos invest a lot of energy selling the dream. One way to do this is by showing off people winning big in your casino. Everyone has seen those corny pictures of people holding human-sized cheques right? It's a solid tactic, since empirical evidence shows that after a store has sold a large-prize winning lottery ticket, the ticket sales increase from 12 to 38% over the following weeks.

If we look at slot machine play, what exactly defines a big win? The first stab we took at this was quite sloppy. We took an arbitrary number and said wins bigger than 500 euro are impressive. This was quick and easy to implement, but when we observed the results we noticed that when you have players playing at high stakes, a win of 500 euro really isn't that impressive, and we would see the exceptional high roller often dominate the results.

What defines a big win, is not the amount, but how many times the win multiplies your stake. Betting 1 euro to win 200 euro sounds like quite the return right? Coming to this conclusion, we had to define a multiplier threshold that indicates a big win.

Having each win correlate to a bet, we could project the multipliers, and look at the distribution.

In this example I'm using matlab, but we could do the same using Excel or code.

So first we load the multipliers data set.

For then to look at its histogram, visualizing how the multipliers are distributed.

Here we notice that there is a skewness towards large values; a few points are much larger than the bulk of data. Logarithmic scales can help us here.

This shows us a pretty fitting bell curve, meaning the multipliers are somewhat log normally distributed. We could now use the log standard deviation to pick the outliers.

But we can also tabulate the data set and hand pick the cut-off of normal wins.

We could now write a rule in our projection of big wins which states that a log(multiplier) larger than 3 is considered to be a big win.

Matlab, Excel and the like are great domain specific tools for data exploration which can help you reach a better feel and understanding.

Sunday, October 18, 2015

Bulk SQL projections with F# and type providers

Early Summer, I had to set up an integration with an external partner. They required of us to daily provide them with a relational dataset stored in SQL Server. Most, if not all of the data was temporal, append-only by nature; think logins, financial transactions..

Since the data required largely lived in an eventstore on our end, I needed fast bulk projections. Having experimented with a few approaches, I eventually settled on projections in F# taking advantage of type providers.

Let's say we have an event for when users watched a video and one for when users shared a video.

We want to take streams from our eventstore and project them to a specific state; a stream goes in and state comes out.

Then we want to take that state, and store it in our SQL Server database.

Some infrastructure that reads a specific stream, runs the projection, stores the state and checkpoints the projection, could look like this.

To avoid data corruption, storing the state and writing the checkpoint happens in the same transaction.

With this piece of infrastructure in place, we are close to implementing an example. But before we do that, we first need to install the FSharp.Data.SqlClient package. Using this package, we can use the SqlProgrammabilityProvider type provider to provide us with types for each table in our destination database. In the snippet below, I'll create a typed dataset for the WatchedVideos table and add a row.

I haven't defined this type, nor was it generated by me. The SqlProgrammabilityProvider type provider gives you these for free, based on the meta data it can extract from the destination database. This also means that when you change your table, without changing your code, the compiler will have no mercy and immediately feed back where you broke your code. In this usecase, where you rather rebuild your data than migrate it, the feedback loop of changing your database model becomes so short, that it allows you to break stuff with much confidence. The only caveat here is that the compiler must always be able to access that specific database, compiling without fails. In practice, this means you need to ship your source with a build script that sets up your database locally before you do any work.

Going from a stream to a dataset is quite declarative and straightforward with the help of pattern matching.

Storing the result in an efficient fashion is also simple, since the dataset directly exposes a BulkCopy method.

When we put this all together, we end up with this composition.

Executing this program, we can see the data was persisted like expected.

In the real world, you also want to take care of batching and logging, but that isn't too hard to implement.

Having this approach in production for some time now, I'm still quite happy with how it turned out. The implementation is fast, and the code is compact and easy to maintain.