Let's break this down a bit. There are two main behaviors you are looking to test:
The Twitter HyperLogLog implementation performs correctly, i.e. it gives a good estimate of the number of items.
Your code consuming HyperLogLog structures (e.g. counters) increments them when appropriate.
Note that behavior #2 is easy to test at build time using unit tests, rather than with integration tests. This is preferable and will catch most issues.
Case #1 can also be broken down into three cases:
A, when the number of items is 0;
B, when the number of items is small (5, 100, or 1000);
C, when the number of items is large (millions/billions).
Again, cases A and B can and should be tested at build time using unit tests. You should decide on acceptable error bounds depending on your application and have the UTs assert that the estimation is within those bounds - it doesn't really matter that you chose HyperLogLog as the underlying estimation method, the tests should treat the estimator as a black box. As a ballpark, I'd say that 10% error is reasonable for most purposes, but this is really up to your particular application. These bounds should represent the worst possible accuracy your application can live with. For example, a counter for critical errors might not be able to live with ANY estimation errors at all, so using HyperLogLog for it should break the unit test. A counter for the number of distinct users might be able to live with as much as 50% estimation error - it's up to you.
So that leaves us with the last case - testing that the HyperLogLog implementation gives a good estimate for a high number of items. This is not possible to test at build time and indeed an integration test is the way to go. However, depending on how much you trust Twitter's HyperLogLog implementation, you might consider just NOT TESTING this altogether - Twitter should have done that already. This might seem like a break from best practices, but considering the overhead which may be associated with an integration test, it might be worth it in your case.
If you DO choose to write an integration test, you will need to model the traffic you expect in production and generate it from multiple sources, as you will be generating millions/billions of requests. You can save a sample of real production traffic and use that for testing (probably the most accurate method), or work out what your traffic looks like and generate similar-looking test traffic. Again, the errors bounds should be chosen according to the application, and you should be able to swap the estimation method for a better one without breaking the test.