Can we improve this o(mn) algorithm of bin-counting numbers?

Question

I have an Array in C# which contains numbers (e.g. int, float or double); I have another array of ranges (each defined as a lower and upper bound). My current implementation is something like this.

        foreach (var v in data)
        {
            foreach (var row in ranges)
            {
                if (v >= row.lower && v <= row.high)
                {
                    statistics[row]++;
                    break;
                }
            }
        }

So the algorithm is O(mn) where m is the number of ranges and n is the size of the numbers.

Can this be improved? because in practical, the n is big and I want this to be as fast as possible.

Ideally, yes they may overlap. but if we can have something very fast based on the fact that they can't overlap, then I'd be happy to accept that. — justyy, Aug 21 '15 at 15:19
If you have overlapping ranges, you can tweak the algorithms given for non-overlapping ranges to still work. Take the list of ranges you start out with and derive a list of non-overlapping "subranges", where a subrange has a lower bound, upper bound, and list of the original ranges that overlap this subrange. Supply this list of subranges to one of the algorithms below to get a histogram over subranges, then for every subrange S for every original range R listed by S, add the total for S to R; this gives us the histogram over the original ranges. — Jerry Federspiel, Aug 21 '15 at 19:05

score 2 · Accepted Answer · edited May 23 '17 at 10:26

Sort data array, then for each interval - find the first index that is in this range in data, and the last one (both using binary search). The number of elements that in this interval is than easily computed by reducing lastIdx-firstIdx (or add +1, depending if lastIdx is inclusive or not).

This is done in O(mlogm + nlogm), where m is the number of data, and n number of intervals.

Bonus: If data is changing constantly, you can use an order statistics tree, with the same approach (since this tree allows you to find easily the index of each element, and is supporting modifying the data).

Bonus2: Optimality proof

Using comparisons based algorithms, this cannot be done better, since if we could, we could also solve element distinctness problem better.

Element Distinctness Problem:

Given an array a1,a2,...,an - find out if there are i,j such that i!=j, ai=aj.

This problem is known to have Omega(nlogn) time bound using comparisons based algorithms.

Reduction:

Given an instance of element distinctness problem a1,...,an - create data=a1,...,an, and intervals: [a1,a1], [a2,a2],..., [an,an] - and run the algorithm.
If there are more than n matches - there is duplicates, otherwise there is none.

The complexity of the above algorithm is O(n+f(n)), where n is the number of elements, and f(n) is the complexity of this algorithm. this has to be Omega(nlogn), so does f(n), and we can conclude there is no more efficient algorithm.

Luaan · Answer 2 · 2015-08-21T18:53:28.120

1

Assuming the ranges are ordered, you always take the first range that fits, right?

This means that you could easily build a binary tree of the lower bounds. You find the highest lower bound that's lower than your number, and check if it fits the higher bound. If the tree is properly balanced, this can get you quite close to O(nlog m). Of course, if you don't need to change the ranges frequently, a simple ordered array will do - just use the usual binary search methods.

Using a hashtable instead could get you pretty close to O(n), depending on how the ranges are structured. If data is also ordered, you could get even better results.

edited Aug 21 '15 at 18:53

answered Aug 21 '15 at 15:19

Luaan

62,244
7
97
116

Seems like, if you're iterating over `data`, you'd get O(nlog m). Or, with hashtable, O(n). Or am I reading it wrong? (The reason I ask is that OP has indicated that `n` is big, so it matters whether it's *log m* or *log n*. – Erick G. Hagstrom Aug 21 '15 at 16:19
@ErickG.Hagstrom Yeah, right - I assumed m being the numbers, not the ranges. – Luaan Aug 21 '15 at 18:53

score 0 · Answer 3 · answered Aug 21 '15 at 15:53

An alternate solution that doesn't involve sorting the data:

var dictionary = new Dictionary<int, int>();

foreach (var v in data) {
    if (dictionary.ContainsKey(v)){
        dictionary[v]++;
    } else {
        dictionary[v] = 1;
    }
}

foreach (var row in ranges) {
    for (var i = row.lower; i <= row.higher; i++) {
        statistics[row] += dictionary[i];
    }
}

Get a count of the number of occurrences of each value in data, and then sum the counts between the bounds of your range.

Can we improve this o(mn) algorithm of bin-counting numbers?

3 Answers3