Throwing dice

September 7, 2018

Last week, I implemented a new feature at work which got deployed to integration. It was some kind of cache cleaning which could be triggered manually. To work correctly, it had to be executed on every single node - and in production, we currently have 6 nodes.

The problem is, as an external developer, I only have access to the load balancer, and not to the nodes themselves. To test it - we had some problems with authentication on the development system - I would have to make sure that every node could be reached.

I wondered: If the load balancer would route me to each node with an equal probability of 1/6th, how many times would I have to try to get to all nodes? 10 times? 15 times? And how sure could I be that all nodes would have been reached?

Throwing dice

I googled around and found out that my problem's essentially the same as throwing a dice and getting each number at least one time. It's called the Coupon collector's problem - how long does it take to collect a certain number of items - and I found out quickly some calculations about the expected value: It's 14.7.

But that wasn't enough for me, I wondered, what if I want to be at least 95% sure to get to all nodes? Calculating the exact probability density function apparently is a lot harder than just calculating the expected value. But hey, who needs complex math if a 30 line python script and some computing power can produce a nice approximation? :-)

This is the empirical probability density function for 10**8 (a hundred million) tries:

Code: Gist

Some interesting observations here:

  • The best case was 6 throws (obviously, if we got lucky), which is pretty rare (only about 1.54% of the cases), but the worst case was a whopping 126 tries, once out of a hundred million. Bad luck!
  • The graph has positive skew, which means mode, median and mean are note the same.
  • The mode is at 11 throws, with 8 441 391 (8.44%) results it's the most common value.

So based on the experimental data, the percentiles I searched for are the following:

  • 50% → 13 throws
  • 90% → 23 throws
  • 95% → 27 throws
  • 99% → 36 throws

So to be 95% sure to get to every node, I would have to fire a total of 27 requests at the load balancer.

Thanks for reading!