0

I have 200 million records, some of which have variable sized fields (string, variable length array etc.). I need to perform some filters, aggregations etc. on them (analytics oriented queries).

I want to just lay them all out in memory (enough to fit in a big box) and then do linear scans on them. There are two approach I can take, and I want to hear your opinions on which approach is better, for maximizing speed:

  1. Using an array of structs with char* and int* etc. to deal with variable length fields
  2. Use a large byte array, scan the byte array like a binary stream, and then parse the records

Which approach would you recommend?

Update: Using C.

jeffreyveon
  • 13,400
  • 18
  • 79
  • 129
  • 2
    I recommend that you pick a language first. – crashmstr Sep 05 '14 at 12:21
  • @crashmstr: Just updated my question - using c/c++ – jeffreyveon Sep 05 '14 at 12:23
  • 4
    @jeffreyveon c and C++ are 2 different languages which one? – mmmmmm Sep 05 '14 at 12:24
  • C/C++ is not a language. Modern C and modern C++ have both changed and diverged since they were first created, and idomatic ("native") solutions can be very different. – crashmstr Sep 05 '14 at 12:31
  • Fair enough. I guess I don't use much of modern C++ anyway, so for all practical purposes, it's C. – jeffreyveon Sep 05 '14 at 12:33
  • You are looking for a struct of arrays, not an array of structs. – Adam Burry Sep 05 '14 at 12:35
  • @AdamBurry I don't read it that way. – ams Sep 05 '14 at 12:39
  • @ams, Jeffrey is concerned about performance, he is examining certain record fields by linear scan, and he has a large amount of data, it sounds like he is looking for struct-of-array. See here for an example discussion: http://stackoverflow.com/questions/11616941/structure-of-arrays-and-array-of-structures-performance-difference – Adam Burry Sep 05 '14 at 12:46
  • @AdamBurry He's looking to interpret existing data, laid out linearly in records ... – ams Sep 05 '14 at 12:55
  • @ams, I am assuming his data is in a file and has to be read in before he can process it. – Adam Burry Sep 05 '14 at 13:11

3 Answers3

5

The unfortunate answer is that "it depends on the details which you haven't provided" which, while true, is not particularly useful. The general advice to approaching a problem like this is to start with the simplest/most obvious design and then profile and optimize it as needed. If it really matters you can start with a few very basic benchmark tests of a few designs using your exact data and use-cases to get a more accurate idea of what direction you should take.

Looking in general at a few specific designs and their general pros/cons:

One Big Buffer

 char* pBuffer = malloc(200000000);
  • Assumes your data can fit into memory all at once.
  • Would work better for all text (or mostly text) data.
  • Wouldn't be my first choice for large data as it just mirrors the data on the disk. Better just to use the hardware/software file cache/read ahead and read data directly from the drive, or map it if needed.
  • For linear scans this is a good format but you lose a bit if it requires complex parsing (especially if you have to do multiple scans).
  • Potential for the least overhead assuming you can pack the structures one after the other.

Static Structure

 typedef struct {
     char  Data1[32];
     int   Data2[10];
 } myStruct;

 myStruct *pData = malloc(sizeof(myStruct)*200000000);
  • Simplest design and likely the best potential for speed at a cost of memory (without actual profiling).
  • If your variable length arrays have a wide range of sizes you are going to waste a lot of memory. Since you have 200 million records you may not have enough memory to use this method.
  • For a linear scan this is likely the best memory structure due to memory cache/prefetching.

Dynamic Structure

 typedef struct {
     char* pData1;
     int*  pData2;
 } myStruct2;

 myStruct2 *pData = malloc(sizeof(myStruct2)*200000000);
  • With 200 million records this is going require a lot of dynamic memory allocations which is very likely going to have a significant impact on speed.
  • Has the potential to be more memory efficient if your dynamic arrays have a wide range of sizes (though see next point).
  • Be aware of the overhead of the pointer sizes. On a 32-bit system this structure needs 8 bytes (ignoring padding) to store the pointers which is 1.6 GB alone for 200 million records! If your dynamic arrays are generally small (or empty) you may be spending more memory on the overhead than the actual data.
  • For a linear scan of data this type of structure will probably perform poorly as you are accessing memory in a non-linear manner which cannot be predicted by the prefetcher.

Streaming

  • If you only need to do one scan of the data then I would look at a streaming solution where you read a small bit of the data at a time from the file.
  • Works well with very large data sets that wouldn't fit into memory.
  • Main limitation here is the disk read speed and how complex your parsing is.
  • Even if you have to do multiple passes with file caching this might be comparable in speed to the other methods.

Which of these is "best" really depends on your specific case...I can think of situations where each one would be the preferred method.

uesp
  • 6,194
  • 20
  • 15
4

You could use structs, indeed, but you'd have to be very careful about alignments and aliasing, and it would all need patching up when there's a variable length section. In particular, you could not use an array of such structs because all entries in an array must be constant size.

I suggest the flat array approach. Then add a healthy dose of abstraction; you don't want your "business logic" doing bit-twiddling.

Better still, if you need to do a single linear scan over the whole data set then you should treat it like a data stream, and de-serialize (copy) the records into proper, native structs, one at a time.

ams
  • 24,923
  • 4
  • 54
  • 75
0

"Which approach would you recommend?" Neither actually. With this amount of data my recommendation would be something like a linked list of your structs. However, if you are 100% sure that you will be able to allocate the required amount of memory (with 1 malloc call) for all your data, then use array of structs.

HighPredator
  • 790
  • 4
  • 21