Best data structure for large dataset that changes frequently

Question

This is the scenario:

We have a large data model with > 2,000,000 data points.
Each datum includes: a measurement (int or float), a timestamp, a quality enum (good, bad, unknown) and a unique id.
Measurement updates come in via Kafka and approximately 20% of the measurements are updated each second.
Every few seconds, an algorithm runs against the current measurement model. At this point the in memory snapshot is locked
When the algorithm is complete, the measurement model is updated with the changes waiting on the kafka bus.

What is the best data structure (for performance) to use for the in memory snapshot of the measurement model?

Thanks

You would probably use a binary search tree or a hash table, implemented in [`std::map`](https://en.cppreference.com/w/cpp/container/map) and [`std::unordered_map`](https://en.cppreference.com/w/cpp/container/unordered_map). — eesiraed, Jul 10 '18 at 00:13
Is the dataset fixed in length or can it grow/shrink? - Do you need or have a flag for int vs float? - What type is the ID, and are the ID's contiguous (or can you still change the type)? - How long is the data locked (and can there be multiple pending updates for the same ID in that time)? - In what form does the data arrive (json, binary)? — Danny_ds, Jul 10 '18 at 01:35
The dataset is fixed. No flag for int vs float. The two measurement types will have separate topics. — user3029642, Jul 10 '18 at 04:25
The data will be locked for about 100ms and there could be multiple pending updates for the same id. The ids are not contiguous and they cannot be changed. — user3029642, Jul 10 '18 at 04:26
If you make the guid a string, you can model the below answer using Avro, which works well with Kafka when you use a Schema Registry — OneCricketeer, Jul 10 '18 at 13:52

Alex Reinking · Answer 1 · 2018-07-10T04:03:43.817

So each of your datums are (basically) the following struct:

struct datum {
    unsigned char guid[16];
    enum { Int, Float } measurement_kind;
    union {
        int i;
        float f;
    } measurement;
    time_t timestamp;
    enum { Good, Bad, Unknown } quality;
};

Which is 40 bytes in size. If you have 2 million of these, that will total up to about 80 megabytes. Even if your data structure has 4x overhead, that's not exactly "big" data. Some Xeon CPUs can almost fit that in their L3 cache

At a minimum, you need a data structure with fast ID lookups. So a hash table (std::unordered_map) is the obvious choice. But there are a few things you might be able to exploit that will help you roll your own hash table implementation that could outperform this.

If your IDs are contiguous (rather than Guids, as I'm assuming), then you can use an array, rather than a hash table, which has the clear advantage of not requiring a hash function. Just use the indices instead.
If you have a fixed (or bounded) number of data points, you can store the actual data points in contiguous memory. Using an open-addressing hash table (unlike std::unordered_map) with a fixed load factor might also be faster. Test storing both pointers to the elements and the elements themselves in the table.
If you can take ownership of Kafka's results, then copying pointers rather than the full structures might be better. Memory fragmentation might make this slow, but it also might not.
If you know that certain measurements get "hot" (ie. are updated frequently), then reordering them in the contiguous store and in the hash table chains could improve your cache locality.
If you know that during an update you won't change the hashtable, then you can partition the updates and parallelize them trivially, without locks.

In all cases, you should test these potential improvements, if they apply, against the standard library implementation. It is impossible to give a certain answer without measuring performance.

This is exactly what I was looking for. Thank you. I will obviously test the std::map, but I figured I could roll my own solution to improve performance. — user3029642, Jul 10 '18 at 04:29
"40 bytes in size" using what serialization? Avro? Protobuf? — OneCricketeer, Jul 10 '18 at 13:50
@cricket_007 - 40 bytes in size using `sizeof(struct datum)`. It's an estimate. — Alex Reinking, Jul 10 '18 at 17:00
I'm planning to test performance on some of the suggestions from this post. I'll update the ticket with results — user3029642, Jul 10 '18 at 17:46

Best data structure for large dataset that changes frequently

1 Answers1