Microbenchmarking with Big Data

Question

I'm currently working on my thesis project designing a cache implementation to be used with a shortest-path graph algorithm. The graph algorithm is rather inconsistent with runtimes, so it's too troublesome to benchmark the entire algorithm. I must concentrate on benchmarking the cache only.

The caches I need to benchmark are about a dozen or so implementations of a Map interface. These caches are designed to work well with a given access pattern (the order in which keys are queried from the algorithm above). However, in a given run of a "small" problem, there are a few hundred billion queries. I need to run almost all of them to be confident about the results of the benchmark.

I'm having conceptual problems about loading the data into memory. It's possible to create a query log, which would just be an on-disk ordered list of all the keys (they're 10-character string identifiers) that were queried in one run of the algorithm. This file is huge. The other thought I had would be to break the log up into chunks of 1-5 million queries, and benchmark in the following manner:

Load 1-5 million keys
Set start time to current time
Query them in order
Record elapsed time (current time - start time)

I'm unsure of what effects this will have with caching. How could I perform a warm-up period? Loading the file will likely clear any data that was in L1 or L2 caches for the last chunk. Also, what effect does maintaining a 1-5 million element string array have (and does even iterating it skew results)?

Keep in mind that the access pattern is important! For example, there are some hash tables with move-to-front heuristics, which re-orders the internal structure of the table. It would be incorrect to run a single chunk multiple times, or run chunks out of order. This makes warming CPU caches and HotSpot a bit more difficult (I could also keep a secondary dummy cache that's used for warming but not timing).

What are good practices for microbenchmarks with a giant dataset?

But it's benchmarking single operation - a hashtable lookup. — efritz, Oct 10 '12 at 17:30
If you're measuring milliseconds, don't use `System.currentTimeMillis`, use `System.nanoTime()`: [System.currentTimeMillis vs System.nanoTime](http://stackoverflow.com/q/351565/1065197) — Luiggi Mendoza, Oct 10 '12 at 17:31
The important thing is to make it realistic on appropriate hardware. If you use smaller datasets it will perform better than loading the complete dataset. You should decide how big the dataset should be and run your test on a machine which can support that size. If you don't have a machine that size you should either get one or revise your estimates of what you need. Running a small dataset and estimating how fast it would be on much larger dataset on a much larger machine is mostly guess work. — Peter Lawrey, Oct 10 '12 at 17:32
I assume you are not talking about Hashtable as it doesn't scale well. I would use HashMap or ConcurrentHashMap as appropriate. — Peter Lawrey, Oct 10 '12 at 17:33
The log file generated from listing the keys that are queried is upwards of 10GB. It'd hard to load a file that size _and_ keep it in memory. — efritz, Oct 10 '12 at 17:33
@PeterLawrey I'm speaking about hashtable ADTs (which I have implemented myself), not the Java Hashtable. — efritz, Oct 10 '12 at 17:34
Hmmmm. If you're just benchmarking a series of single operations, you might be able to use [Caliper](http://caliper.googlecode.com) and run the reps in the specified order, initializing everything in `setUp()`. That said, even just loading the data is more expensive than any Caliper microbenchmark I've ever seen...so, ew. — Louis Wasserman, Oct 10 '12 at 17:44
I actually found Caliper a few minutes ago. Doesn't solve this particular problem, but it does look elegant for other problems. — efritz, Oct 10 '12 at 17:50
@efritz You can cache 10 GB of data with 10 GB of memory. Its what you do with it that might take up the space (16 GB of memory can cost $80 so its not allot these days) — Peter Lawrey, Oct 10 '12 at 22:04

DNA · Answer 1 · 2012-10-10T20:29:20.640

1

If I understand the problem correctly, how about loading the query log on one machine, possibly in chunks if you don't have enough memory, and streaming it to the machine running the benchmark, across a dedicated network (a crossover cable, probably), so you have minimal interference between the system under test, and the test code/data ...?

Whatever solution you use, you should try multiple runs so you can assess repeatability - if you don't get reasonable repeatability then you can at least detect that your solution is unsuitable!

Updated: re: batching and timing - in practice, you'll probably end up with some form of fine-grained batching, at least, to get the data over the network efficiently. If your data falls into natural large 'groups' or stages then I would time those individually to check for anomalies, but would rely most strongly on an overall timing. I don't see much benefit from timing small batches of thousands (given that you are running through many millions).

Even if you run everything on one machine with a lot of RAM, it might be worth loading the data in one JVM and the code under test on another, so that garbage collection on the cache JVM is not (directly) affected by the large heap required to hold the query log.

edited Oct 10 '12 at 20:29

answered Oct 10 '12 at 19:40

DNA

42,007
12
107
146

If you utilize the network, do you perform the key lookups one-by-one as they come in, or should they be batched in groups of 10,000 or 1,000,000 lookups? In either case, how much would you be timing (everything, or groups of lookups)? – efritz Oct 10 '12 at 20:18
For reference, reading the file takes about 80-90% of the entire time it takes to run the test (reading file, creating test array, performing all lookups). – efritz Oct 10 '12 at 20:19
In which case it sounds like it would be worth investing in enough RAM to load the dataset once, then feed it to multiple candidates. – DNA Oct 10 '12 at 20:23
I have 16GB of RAM on my Desktop, but it's the speed of loading the file that I'm worried about (and whatever cache effects it has). – efritz Oct 10 '12 at 20:50

Microbenchmarking with Big Data

1 Answers1