Reliable integration test for code using HyperLogLog?

Question

We are using Twitter's implementation of HyperLogLog in Algebird. Given a number N and a check in our system that uses HyperLogLog to estimate the current size of a gradually-growing collection and test if it is more or less than N, how can we write an integration or system test that tests this check and is almost guaranteed to pass, if our code that calls HyperLogLog is correct? The system under test is non-deterministic, because, for one thing, it's multi-threaded.

My first thought was that the right way to write an integration test which is reliable for this use case is to "drop our standards". So what is a sufficient number of items (M) to post to an endpoint to be sure that HyperLogLog will estimate the total number of items as being more than N, with probability, say, >= 0.999999?

Or is there a better approach?

The standard error bounds is configurable, but that doesn't directly tell us the maximum error bounds that we might plausibly see once in a while - which is what I care about in order to avoid random failed CI builds on master causing wasted time and hair-pulling!

I'm also concerned that the way we generate random data in the tests might not generate uniformly-distributed random data in the relevant respects, which might materially affect the probability calculations.

Do you have the ability to insert "fake items" with a per-bucket "height" / "number of leading zeroes"? — Greg Nisbet, Nov 05 '16 at 04:02
@GregoryNisbet I don't think there is an API method to do that. — Robin Green, Nov 05 '16 at 07:18

score 2 · Answer 1 · answered Nov 05 '16 at 09:35

Let's break this down a bit. There are two main behaviors you are looking to test:

The Twitter HyperLogLog implementation performs correctly, i.e. it gives a good estimate of the number of items.
Your code consuming HyperLogLog structures (e.g. counters) increments them when appropriate.

Note that behavior #2 is easy to test at build time using unit tests, rather than with integration tests. This is preferable and will catch most issues.

Case #1 can also be broken down into three cases:

A, when the number of items is 0;

B, when the number of items is small (5, 100, or 1000);

C, when the number of items is large (millions/billions).

Again, cases A and B can and should be tested at build time using unit tests. You should decide on acceptable error bounds depending on your application and have the UTs assert that the estimation is within those bounds - it doesn't really matter that you chose HyperLogLog as the underlying estimation method, the tests should treat the estimator as a black box. As a ballpark, I'd say that 10% error is reasonable for most purposes, but this is really up to your particular application. These bounds should represent the worst possible accuracy your application can live with. For example, a counter for critical errors might not be able to live with ANY estimation errors at all, so using HyperLogLog for it should break the unit test. A counter for the number of distinct users might be able to live with as much as 50% estimation error - it's up to you.

So that leaves us with the last case - testing that the HyperLogLog implementation gives a good estimate for a high number of items. This is not possible to test at build time and indeed an integration test is the way to go. However, depending on how much you trust Twitter's HyperLogLog implementation, you might consider just NOT TESTING this altogether - Twitter should have done that already. This might seem like a break from best practices, but considering the overhead which may be associated with an integration test, it might be worth it in your case.

If you DO choose to write an integration test, you will need to model the traffic you expect in production and generate it from multiple sources, as you will be generating millions/billions of requests. You can save a sample of real production traffic and use that for testing (probably the most accurate method), or work out what your traffic looks like and generate similar-looking test traffic. Again, the errors bounds should be chosen according to the application, and you should be able to swap the estimation method for a better one without breaking the test.

I don't follow your point that "a counter for critical errors might not be able to live with ANY estimation errors at all, so using HyperLogLog for it should break the unit test". Surely you would use a mock or fake in place of the actual HLL, in a unit test, so a unit test wouldn't pick up that problem? — Robin Green, Nov 05 '16 at 11:20
Right, that was unclear. I was talking about how to determine which error bounds to test for, and the point is that it really depends on the application - the example I was trying to give was that code for a critical counter which depends on exact estimates should not be able to use HyperLogLog without breaking a test. In any case, I think you got the idea :) — OronNavon, Nov 17 '16 at 17:16

Reliable integration test for code using HyperLogLog?

1 Answers1