So each of your datums are (basically) the following struct:
struct datum {
unsigned char guid[16];
enum { Int, Float } measurement_kind;
union {
int i;
float f;
} measurement;
time_t timestamp;
enum { Good, Bad, Unknown } quality;
};
Which is 40 bytes in size. If you have 2 million of these, that will total up to about 80 megabytes. Even if your data structure has 4x overhead, that's not exactly "big" data. Some Xeon CPUs can almost fit that in their L3 cache
At a minimum, you need a data structure with fast ID lookups. So a hash table (std::unordered_map) is the obvious choice. But there are a few things you might be able to exploit that will help you roll your own hash table implementation that could outperform this.
- If your IDs are contiguous (rather than Guids, as I'm assuming), then you can use an array, rather than a hash table, which has the clear advantage of not requiring a hash function. Just use the indices instead.
- If you have a fixed (or bounded) number of data points, you can store the actual data points in contiguous memory. Using an open-addressing hash table (unlike std::unordered_map) with a fixed load factor might also be faster. Test storing both pointers to the elements and the elements themselves in the table.
- If you can take ownership of Kafka's results, then copying pointers rather than the full structures might be better. Memory fragmentation might make this slow, but it also might not.
- If you know that certain measurements get "hot" (ie. are updated frequently), then reordering them in the contiguous store and in the hash table chains could improve your cache locality.
- If you know that during an update you won't change the hashtable, then you can partition the updates and parallelize them trivially, without locks.
In all cases, you should test these potential improvements, if they apply, against the standard library implementation. It is impossible to give a certain answer without measuring performance.