Fast, in-memory range lookup against +5M record table

Question

I have a database table with +5M static records in place. Simple structure: (starting int, ending int, result int). So i have an certain INT and i need to find it's corresponding result(int). Currently, the look-up table is in DB, but it needs to reside in-memory, most likely in the environment without a database access.

My solution needs to perform this logic without a database access, in memory and super fast as I need to process 1000s of transactions per second. The size of the set is little over 50MB so I could throw the whole thing into the memory and run range look-ups against it, per this post: Doing a range lookup in C# - how to implement. But I don't know how it will perform on such scale.

Do I pre-load that table "on startup" ? It might take a while.
Any way to load the table into some .dat file and have super efficient look-up at run time?

BTW, I am on Azure, not sure if using Storage Tables helps on lookups...

"I have a database table" .. "But my solution needs to perform this logic without a database access" - so how does that data get into memory? You read the database, right? — Mitch Wheat, Mar 07 '13 at 03:25
loading it all into memory is going to be the fastest. All depends if the data is changing. Then it gets a bit trickier to manage the saves back to the database, but if its not changing, then just cache into memory — Keith Nicholas, Mar 07 '13 at 03:27
@MitchWheat - sorry, should've been more clear. Currently, the look-up table is in DB, but it needs to reside in-memory, most likely in the environment without a database access. — enlightenedOne, Mar 07 '13 at 03:29
That still makes little sense...To get into memory in the first instance you have to read it from somewhere. — Mitch Wheat, Mar 07 '13 at 03:29
@MitchWheat sure, maybe I assumed too much, but I'd figure you'd use something in memory appropriate to what you want to lookup — Keith Nicholas, Mar 07 '13 at 03:31
@enlightenedOne Mitch is being pedantic, you've implied you don't want to use any database access rather than saying you don't want to use the DB to do the range lookup, but still happy to use the database to load the data into memory :) — Keith Nicholas, Mar 07 '13 at 03:32

Brandon · Accepted Answer · 2013-03-07T03:39:04.343

4

binary searches are super fast. A binary search on 50M records only takes 27 comparisons to find the answer. Just load it into memory and use the Range lookup you linked.

If you find it is slow, start optimizing:

Change the Range object to struct instead of class
Hand-code your own binary search algorithm that (a) implements the equality comparer directly instead of calling out to an IEqualityComparer and (b) uses pointers and other unsafe tricks to disable array bounds checking while doing the search.

edited Mar 07 '13 at 03:39

answered Mar 07 '13 at 03:33

Brandon

38,310
8
82
87

You'd avoid having to follow a pointer to the Range objects on the heap during your search. It is an admittedly very minor perf. improvement. – Brandon Mar 07 '13 at 03:35
Ummmm ... 2^17 is only 128K. You'd need 27 probes for 50 million. – Jim Mischel Mar 07 '13 at 03:37
The array of Range structs will still be stored in the heap. http://stackoverflow.com/questions/1113819/arrays-heap-and-stack-and-value-types – Eric J. Mar 07 '13 at 03:46
Sure. One pointer to access the array...and then you have your struct. As ref objects, it would be 1 pointer to access the array...which gives you a 2nd pointer to access the ref object at that location. – Brandon Mar 07 '13 at 03:48
Not to mention, one heap allocation of 50MB will put less pressure on the GC than 50M small objects. – Brandon Mar 07 '13 at 03:50
Brandon, how do you recommend I load that file into Range[] in a fast fashion? The file is just one line per range (from,to,result). Thanks! – enlightenedOne Mar 07 '13 at 15:52
That seems worthy of a new SO question :) – Brandon Mar 07 '13 at 16:06
1

Here it goes... http://stackoverflow.com/questions/15276164/binary-search-how-to-load-5m-records-from-a-file-into-rangeint-array – enlightenedOne Mar 07 '13 at 16:15

score 2 · Answer 2 · answered Mar 07 '13 at 03:32

2

The range lookup code you link to performs a binary search, so performance will be O(log n). I don't think you can do better than that for a range lookup. A HashSet<T>'s lookup is O(1), but you cannot use that structure for a range lookup.

5 million records is not really a huge amount. I suggest you implement a proof of concept with the code you link to on the hardware you will use in production, and measure performance.

answered Mar 07 '13 at 03:32

Eric J.

147,927
63
340
553

What about the in memory Database? will search faster than binary search or what is the search algorithm used? – TalentTuner Mar 07 '13 at 03:49
@Saurabh: Any in-memory database presents an abstraction layer on top of the data to handle query semantics. There will always be some performance loss compared to a data structure designed to handle one very specific query well. If performance is paramount, the OP's binary search approach will be faster than an in-memory, general purpose DB. – Eric J. Mar 07 '13 at 03:51
Do you know how much is the difference in terms of BigOh – TalentTuner Mar 07 '13 at 03:54
Thanks Eric, how do you recommend I load that file into Range[] in a fast fashion? – enlightenedOne Mar 07 '13 at 15:43
@enlightenedOne: You can use ADO.Net to just select the data from the table and create one Range per row. – Eric J. Mar 08 '13 at 05:51
@Saurabh: They probably have the same Big-O performance (DB's will use a similar algorithm), but absolute performance will almost certainly be better with the custom data structure because of a (probably) O(1) overhead for the query semantics. O(log n) + O(1) is approximately equal to O(log n) for large n, but in reality n is probably small enough that the O(1) component actually matters. – Eric J. Mar 08 '13 at 05:53

score 0 · Answer 3 · answered Mar 07 '13 at 03:34

You could certainly put that in a sequential file and load it at startup. 50 MB will come off the disk in less than a second. And even if you had to parse it as a text file, you should be able to create the table in another second. 5 million records just isn't that large when you're processing them with a 2 GHz (or faster) processor.

Binary search of the list is O(log n), so the maximum number of probes you'll do per search is 24. That's gonna be pretty darned quick.

It should be easy enough to load test something like this. Just spin it up and then see how long it takes to do, say, 1,000,000 lookups. Something like:

var clock = Stopwatch.StartNew();
for (int i = 0; i < NumIterations; ++i)
{
    int val = GetRandomValueToSearchFor(); // however you do that
    Ranges.BinarySearch(val, RangeComparer);
}
clock.Stop();
// time per iteration is clock.TotalMilliseconds/NumIterations

That'll let you figure out the absolute fastest you can query the thing. I suspect you'll be fine with thousands of transactions per second.

Fast, in-memory range lookup against +5M record table

3 Answers3

Linked