0

This is the scenario:

  1. We have a large data model with > 2,000,000 data points.
  2. Each datum includes: a measurement (int or float), a timestamp, a quality enum (good, bad, unknown) and a unique id.
  3. Measurement updates come in via Kafka and approximately 20% of the measurements are updated each second.
  4. Every few seconds, an algorithm runs against the current measurement model. At this point the in memory snapshot is locked
  5. When the algorithm is complete, the measurement model is updated with the changes waiting on the kafka bus.

What is the best data structure (for performance) to use for the in memory snapshot of the measurement model?

Thanks

user3029642
  • 909
  • 3
  • 10
  • 23
  • How are you accessing the elements? By the ID? – eesiraed Jul 10 '18 at 00:08
  • By the ID, yes. – user3029642 Jul 10 '18 at 00:11
  • You would probably use a binary search tree or a hash table, implemented in [`std::map`](https://en.cppreference.com/w/cpp/container/map) and [`std::unordered_map`](https://en.cppreference.com/w/cpp/container/unordered_map). – eesiraed Jul 10 '18 at 00:13
  • 2
    Is the dataset fixed in length or can it grow/shrink? - Do you need or have a flag for int vs float? - What type is the ID, and are the ID's contiguous (or can you still change the type)? - How long is the data locked (and can there be multiple pending updates for the same ID in that time)? - In what form does the data arrive (json, binary)? – Danny_ds Jul 10 '18 at 01:35
  • And what is the resolution of the timestamp? – Danny_ds Jul 10 '18 at 02:20
  • The dataset is fixed. No flag for int vs float. The two measurement types will have separate topics. – user3029642 Jul 10 '18 at 04:25
  • The data will be locked for about 100ms and there could be multiple pending updates for the same id. The ids are not contiguous and they cannot be changed. – user3029642 Jul 10 '18 at 04:26
  • The form of the incoming data is still up in the air. – user3029642 Jul 10 '18 at 04:28
  • The resolution of the timestamp will be to the millisecond. – user3029642 Jul 10 '18 at 04:30
  • If you make the guid a string, you can model the below answer using Avro, which works well with Kafka when you use a Schema Registry – OneCricketeer Jul 10 '18 at 13:52

1 Answers1

1

So each of your datums are (basically) the following struct:

struct datum {
    unsigned char guid[16];
    enum { Int, Float } measurement_kind;
    union {
        int i;
        float f;
    } measurement;
    time_t timestamp;
    enum { Good, Bad, Unknown } quality;
};

Which is 40 bytes in size. If you have 2 million of these, that will total up to about 80 megabytes. Even if your data structure has 4x overhead, that's not exactly "big" data. Some Xeon CPUs can almost fit that in their L3 cache


At a minimum, you need a data structure with fast ID lookups. So a hash table (std::unordered_map) is the obvious choice. But there are a few things you might be able to exploit that will help you roll your own hash table implementation that could outperform this.

  1. If your IDs are contiguous (rather than Guids, as I'm assuming), then you can use an array, rather than a hash table, which has the clear advantage of not requiring a hash function. Just use the indices instead.
  2. If you have a fixed (or bounded) number of data points, you can store the actual data points in contiguous memory. Using an open-addressing hash table (unlike std::unordered_map) with a fixed load factor might also be faster. Test storing both pointers to the elements and the elements themselves in the table.
  3. If you can take ownership of Kafka's results, then copying pointers rather than the full structures might be better. Memory fragmentation might make this slow, but it also might not.
  4. If you know that certain measurements get "hot" (ie. are updated frequently), then reordering them in the contiguous store and in the hash table chains could improve your cache locality.
  5. If you know that during an update you won't change the hashtable, then you can partition the updates and parallelize them trivially, without locks.

In all cases, you should test these potential improvements, if they apply, against the standard library implementation. It is impossible to give a certain answer without measuring performance.

Alex Reinking
  • 16,724
  • 5
  • 52
  • 86