2

What kind of data structure should I use if I have a bunch of the following data:

MyData
{
    float ValueA;
    float ValueB;
    object OtherInfo;
}

And I need to quickly find all MyData instances that meet this kind of criteria:

0.5 <= ValueA <= 0.7    AND    0.2 <= ValueB <= 0.9

It's simple to find these items, but I need it to be done as fast as possible. Is there a data structure ideal for this? I don't mind using up extra memory.

My current idea is to have a sorted list, sorted by ValueA. Then I would do a binary search to find the first item that has a ValueA >= 0.5, and iterate until I get an item with ValueA > 0.7. For each iteration, I check if the ValueB is between 0.2 and 0.9. I put all the matching items in my output dataset.

Is there a better/faster way of doing it?

I'm using C#, by the way.

lifeformed
  • 525
  • 4
  • 16
  • 1
    I don't think a Binary search would work well. It would find AN instance where MyData.ValueA >= .5, but then you would have to loop backwards through the collection to find the FIRST instance where ValueA >= .5 – Alex Jul 08 '14 at 17:26
  • 1
    @Alex Once you've found a single instance it's a constant amount of work to find all of the other items matching the condition as you're simply walking up or down, which is an very efficient thing to do. A binary search is *exactly* the right tool for the job here. – Servy Jul 08 '14 at 17:31
  • @Servy search tree in some form is indeed possible answer, but I think there is no way "constant amount of work to find all of the other items" to be true unless I'm reading "constant amount" wrong - i.e. what if all items match the criteria and you need to enumerate them all? – Alexei Levenkov Jul 08 '14 at 18:07
  • @AlexeiLevenkov It's a constant amount *per item that matches*. It's *impossible* for this method to perform faster than O(M), where M is the number of matches. The asymptotic complexity of the method can never be better than the size of the output, because just printing the output takes that much time. – Servy Jul 08 '14 at 18:10
  • @Servy that indeed makes sense... It was not exactly clear to me what you mean in the original comment. – Alexei Levenkov Jul 08 '14 at 18:19

4 Answers4

3

An interval tree would fit your needs. Example implementation (which I have not tested) at Codeplex. More here.

Update: A comparison of interval trees with a related structure, the segment tree, is here: What are the differences between segment trees, interval trees, binary indexed trees and range trees?

Community
  • 1
  • 1
dbc
  • 104,963
  • 20
  • 228
  • 340
2

Your problem is two-dimensional but you're only optimizing the search space in one dimension. A quadtree is an appropriate data structure; you should have no difficulties finding suitable implementations in C#. If you want other alternatives then just search for 2D spatial partitioning algorithms. That's essentially what you're doing here; ValueA and ValueB may be treated as 2D coordinates even if that's not what they actually represent.

RogerN
  • 3,761
  • 11
  • 18
  • I don't think a quadtree would apply here. terence has a single real number and wants to find all intervals containing it. Quadtree would apply if terence had two real numbers and wanted to find other intervals with start and ends similar to those two numbers. – dbc Jul 08 '14 at 18:02
  • dbc, that's not how I read his question. He states that he has multiple data elements containing ValueA and ValueB, and he wants to query them with a single interval. That's exactly what a quadtree is good for; if that's not what he wants then the original question should be clarified. – RogerN Jul 08 '14 at 18:09
2

You can do this with a single list that's sorted first on ValueA and then on ValueB. So if you have these items:

  A    B
{0.5, 0.7}
{0.3, 0.1}
{0.5, 0.2}
{0.5, 0.4}
{0.2, 0.3}

You order them in your list like this:

{0.2, 0.3}
{0.3, 0.1}
{0.5, 0.2}
{0.5, 0.4}
{0.5, 0.7}

Sorting by two criteria in LINQ is really easy:

var sortedList = theList.OrderBy(s => s.ValueA).ThenBy(s => s.ValueB).ToList();

You could also sort a list in-place by passing a custom comparison delegate to the Sort method.

Now, given the search: 0.5 <= ValueA <= 0.7 && 0.2 <= ValueB <= 0.9, you do the following:

  1. Binary search to find the first ValueA that is >= 0.5. Call this startA
  2. Binary search to find the last ValueA that is <= 0.7. Call this endA
  3. Binary search the interval [startA..endA] for the first ValueB that is >= 0.2. Call this startB.
  4. Binary search the interval [startA..endA] for the last ValueB that is <= 0.9. Call this endB.

The items you're looking for are in the interval [startB..endB].

Note that if you're going to iterate over all the items in the [startB..endB] range, you don't have to do that last binary search, as you could filter them out during iteration.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
1

In this solution I'm binning the data into lists, where each list contains all data-points within a widthX by widthY box. If your data is in the range valA=[0,1], valB=[0,2] and you're dividing the grid into .1 by .1 bins (as shown in code), then searchList(4,12) is a list containing all values such that .4 <= valA <= .5 and 1.2 <= valB <= 1.3

This solution will only be efficient for rather large input arrays that are fairly tight. If there are any extreme outliers, you will create many unnecessary bins.

It works best if the queries are on 'natural' (I.E. .1, .01) boundaries, but it's not necessary. You would have to manually loop through searchList(i,j) to filter out unwanted elements.

Search queries are fairly easy - find your min/max bin on each dimension, and loop over all bins in between.

static List<MyData>[,] GetSearchArray(List<MyData> srcList)
{
  float minA = srcList.Min(s => s.valA);
  float minB = srcList.Min(s => s.valB);
  float maxA = srcList.Max(s => s.valA);
  float maxB = srcList.Max(s => s.valB);

  // Round the min's down, the max's up. 
  // If your searches are likely to fall on .1 intervals, that's a good rounding distance.
  // Note, they don't have to be the same for valA and valB
  minA = (float)Math.Floor(minA*10f)/10f;
  minB = (float)Math.Floor(minB*10f)/10f;
  maxA = (float)Math.Ceiling(maxA*10f)/10f;
  maxB = (float)Math.Ceiling(maxB*10f)/10f;

  // How many groupings between minA->maxA and minB->maxB
  int divsA = (int)Math.Round((maxA-minA)/.1);
  int divsB = (int)Math.Round((maxB-minB)/.1);

  // 2d array of data lists.
  var searchList = new List<MyData>[divsA,divsB];

  for (int i = 0; i < divsA; i++)
  {
    for (int j = 0; j < divsB; j++)
    {
      // row i: (minA + i*.1) <= valA <= (minA + (i+1)*.1)
      // col j: (minB + j*.1) <= valB <= (minB + (j+1)*.1)
      searchList[i, j] = new List<MyData>();
      var query = srcList.Where(s => (minA + i * .1 <= s.valA) && (s.valA <= minA + (i + 1) * .1) &&
                                     (minB + j * .1 <= s.valB) && (s.valB <= minB + (j + 1) * .1));
      foreach (MyData d in query)
        searchList[i, j].Add(d);

      // You may want to sort searchList here for fast intra-bin queries.
    }
  }

  return searchList;
}
Alex
  • 266
  • 1
  • 6