2

I am reading in statistical data from running a program with different configurations. Let's say there are 6 configurations (a, b, ..., f). The configurations may not change linearly, so if you think of the measurements as a table, there could be gaps in the table. The question is regarding structuring these statistical data in memory.

The first thing that comes to mind is to read these configurations to a dynamically allocated 6-deep array or arrays:

struct data ******measurements;

Now this works fine. You have very little memory overhead (only the configurations that have data will actually be allocated) and the access time is O(1).

Besides the fact that most people don't like ****** pointers, this has the downside that adding configurations means adding dimensions to the array, which could get ugly unless read/write to the array is encapsulated in a function. (Write is already encapsulated to take care of allocation when necessary, so this in fact is not such a big deal).

Another idea that comes to mind is to use a map of struct config to struct data using an AVL tree or something (which I already have so has no implementation overheads). This solves the problem of extending the configuration parameters, but decreases access time to O(log(n)).

The number of tests could get considerably large for the O(log(n)) to make a difference.

My question is: Is using a 6-deep nested array here justified? Or is there a better way of doing this?

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
  • Why not use a flat single-dimensional array and do your own index computations instead of letting the compiler do it? – Adam Rosenfield Jun 11 '12 at 17:22
  • Having a n-dimensional array will increase the possibility of cache misses and can result in very poor performance of the code. Employing a smart data-structure will give you much better performance. If your data set is nominally constant you could build a hash table which has nearly O(1) lookup with a good hash function with the only downside being you have to construct it first... – Mosby Jun 11 '12 at 17:26
  • This is not a 6D array but only an emulation of it. C has n-dimension arrays built-in, no need for the pointer overhead. – Jens Gustedt Jun 11 '12 at 20:08
  • @AdamRosenfield, the table is (kind of) sparse. Allocating memory for all the wholes means tens of gigabytes of memory. – Shahbaz Jun 12 '12 at 08:06
  • @JensGustedt, read my comment above. I can't afford to allocate memory for non existing hyper-rows in the table. – Shahbaz Jun 12 '12 at 08:07
  • @Shahbaz, it's just that your vocabulary is wrong. You are not using a 6D array. – Jens Gustedt Jun 12 '12 at 08:09
  • @JensGustedt, I always saw this referred to as "dynamic n-dimensional array", isn't that correct? If so, it is still an "n-dimensional array", the same way "a blue car" is still "a car". – Shahbaz Jun 12 '12 at 08:24
  • @Shahbaz, such a vocabulary is just not precice. And if you want to stay in your example, is a drive emulation a car? Probably not. But anyhow, have a look into my answer to see if it helps :) – Jens Gustedt Jun 12 '12 at 08:35
  • @JensGustedt, Excuse me if I don't quite feel the difference, English is not my mother tongue. However, I believe it would look quite weird if I change all instances of "array" to "emulation of array" in the question text. What do you suggest I do? – Shahbaz Jun 12 '12 at 08:38

3 Answers3

2

Note that your current setup is not O(1), it is O(k), where k is number of dimensions. With a balanced tree it goes to O(log 2^k) == O(k) anyway (I'm assuming that each dimension has two choices; but it doesn't matter… it's just a constant here). You might or might not expect bigger overhead on the implementation of a balanced tree though.

What you can do is to try to abstract the interface with typedefs and getter/setter functions (preferably inlineable), then use any implementation you want. Then you don't have to deal with pointers that much, and still use any structure inside.

liori
  • 40,917
  • 13
  • 78
  • 105
  • Yes it is `O(k)`, `k` being the number of configurations. Each configuration may be any integer (say less than 65k), though, although not all possible values are logged. So let's say one configuration may take 200 possible values and another a different number. If you have `xi` possible values for config `i`, you have in total `n=x1*x2*x3*...*x6` measurements. With a balanced tree, this becomes `O(log(n))`, which is completely unrelated to `k`. – Shahbaz Jun 12 '12 at 08:11
  • Using the 6d array is in fact quite simple, you just need `write_to` and `read_from` (inline) functions that take the 6 indices. You are just confirming that I should go with the array then? – Shahbaz Jun 12 '12 at 08:12
  • I'm just saying that if you abstract the interface, you can test which method is actually the fastest in your specific case, without changing other people's code. – liori Jun 12 '12 at 11:08
2

X-trees are a common choice for efficient storage and lookup of high-dimensional data. The wikipedia article linked is just a stub, but it points to a more authoritative source.

They are basically an improved version of a R-Tree, optimized for minimized overlaps.

I don't know of a c implementation off hand.

dmckee --- ex-moderator kitten
  • 98,632
  • 24
  • 142
  • 234
  • This is very interesting. I also read on R-trees. It seems to me though, that given the tabular nature of my data, this will reduce to a mixture of the arrays and tree method. For example a balanced tree over the first four configurations and inside each group a 2D array. What do you think? – Shahbaz Jun 12 '12 at 08:20
1

The real performance bottleneck (besides wasting memory) are not computations but the number of indirections and cache misses that you will encounter. These can dominate by a factor of hundred to thousand the computation of indices and stuff like that. So your best choice is to reduce the number of indirections and invest a bit into computation. As far as I can see this would best be done by hashing your 6 indices. This would reduce your indirections to two, first looking up the value (probably a pointer) that is stored in the hash table, and then fetching the data that your are interested in directly.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177