1

I am performing something similar to an N-dimensional convolution, but will be combining values that are close to one another as I proceed, to save memory and time.

  1. I look for a key in the array.
  2. If I find the key, I add to the value stored at that key.
  3. If I do not find the key, I find the next highest and next lowest key.
  4. If the closer of the two neighbors is close enough, then I accumulate with that key-value pair.
  5. Otherwise I add a new key-value pair.

The key is a double. It is always positive and never infinite. (I handle zeroes specially.) I expect the values to range from pennies to as high as 100 billion. The rounding coarseness will change as the algorithm proceeds to maintain a maximum array size between 10,000 and 1,000,000. (Only testing will reveal the sweet spot in the trade-off between speed, memory and accuracy.) Because of the range of values versus array size, direct addressing is not practical; I need sparse storage.

The naive approach is to use a List and perform a BinarySearch to find the key or insertion point, then proceed from there. This is fast for finding the nearest key, can be iterated in key order, but inserts are horrible. (I do not need to perform deletes! Each iteration in the outer loop creates a new list from scratch.)

What data structure is recommended? Wikipedia mentions a few, like Trie, Judy array, etc.

(I implemented something Trie-like with similar characteristics years ago, but that was in java, took me a week to implement, and was tricky. I am crunched for time.)

UPDATE:

The suggestion of SortedSet causes me to modify my requirements. While finding the next lowest and next highest key was my way of accomplishing my task, SortedSet.GetViewBetween goes about things differently. Since I just want to see if there is a value close enough to be aggregated with, and I have a certain rounding granularity G, I can just ask for all elements of interest using

var possibilities = mySet.GetViewBetween(x - G, x + G)

If that set is empty, I need to add. If it is not, it is probably a small set and I iterate through it.

I need to perform performance testing to see if it is fast enough. But even if it does not, another collection that has the same contract is an acceptable alternative to FindNextHighestKey and FindNextLowestKey.

UPDATE 2:

I have decided to use plain Dictionary, and force the keys into buckets using a custom rounding function. Iterating the items in sorted order is not crucial, and by using this rounding function, I can find "close enough" values with which to aggregate. I will not change the granularity during an iteration; I will adjust it every time I finish convolving with a new dimension. Each iteration I create a new array to hold the results of that pass.

Paul Chernoch
  • 5,275
  • 3
  • 52
  • 73
  • I know no built in collection that supports this. BCL support for ordered collections sucks. – CodesInChaos Jan 11 '13 at 15:05
  • Not homework. The application is the N-fold convolution of probabilistic losses from N properties due to a hurricane to get the expected value of the loss. Insurance risk analysis. – Paul Chernoch Jan 11 '13 at 15:18

2 Answers2

1

If your key is unique you may look at Dictionary<TKey,TValue> or SortedDictionary<TKey,TValue>

chameleon
  • 984
  • 10
  • 15
  • How would you implement "If I do not find the key, I find the next highest and next lowest key." – CodesInChaos Jan 11 '13 at 15:06
  • I don't believe there is an efficient mechanism for getting 'neighboring keys' from either of these. – Chris Pitman Jan 11 '13 at 15:09
  • There is an efficient way to see if a key is between x-k and x+k: write a custom comparator for the dictionary that enforces the granularity requirements. However, as my algorithm proceeds, I will be widening the granularity as necessary, and once you add items to a Dictionary, you better not change the comparator midstream! – Paul Chernoch Jan 11 '13 at 15:45
  • In the end, plain old Dictionary is what I needed all along. It does not answer my problem as stated, but it gets the job done. Before I add a value, I round its key, but store a structure with the unrounded value. Then to look for a value, I round the key being searched and return the value. The hardest part was writing a rounding function... – Paul Chernoch Jan 12 '13 at 18:44
1

I found this question, which let me to SortedSet<T>.

If you can handle O(log(n)) for insert, delete, and lookup, this might be where you should keep your keys.


Based on your new requirement... Why not just map the doubles by the granularity to sparse keys before use and go with a Dictionary<double, T> ? This won't work if you want the granularity to change during runtime, but neither would the other approach really.

Community
  • 1
  • 1
Amy B
  • 108,202
  • 21
  • 135
  • 185
  • Seems a bit annoying to use, but it should be possible to implement a binary search with that. – CodesInChaos Jan 11 '13 at 15:17
  • @CodesInChaos it is a binary search tree. You just call SortedSet.Contains to do the binary search. It doesn't implement IList, so prev and next element is a little tricky. I'm thinking that GetViewBetween may help there (since the window boundaries are known). – Amy B Jan 11 '13 at 15:23
  • Looks close to what I need. I had considered red-black trees, and knew about SortedSet but it has no "FindClosest" method. However, I did not know about the "GetViewBetween" method. If it is efficient, I could accomplish my task using that. I will investigate. – Paul Chernoch Jan 11 '13 at 15:26
  • 1
    Raymond Chen commented on this question http://stackoverflow.com/questions/9850975/why-sortedsett-getviewbetween-isnt-olog-n about a "performance problem" with GetViewBetween. It seems that GetViewBetween counts the elements in the view, so you want to be fairly sure you aren't getting a large view (which it sounds like you are already doing). – Amy B Jan 11 '13 at 15:46
  • +1 to you, David B. I was just about to write the performance test. Clearly SortedSet will not perform for me with the problem highlighted in that post. – Paul Chernoch Jan 11 '13 at 16:38
  • Given what you've described, there should be only 2 or 3 elements in the view, which shouldn't be a problem. It's not O(n) where n is elements in the SortedSet. It's O(n) where n is elements in the subset that the view sees. – Amy B Jan 11 '13 at 18:15