12

I am just wondering what is the best approach for that calculation. Let's assume I have an input array of values and array of boundaries - I wanted to calculate/bucketize frequency distribution for each segment in boundaries array.

Is it good idea to use bucket search for that?

Actually I found that question Calculating frequency distribution of a collection with .Net/C#

But I do not understand how to use buckets for that purpose cause the size of each bucket can be different in my situation.

EDIT: After all discussions I have inner/outer loop solution, but still I want to eliminate the inner loop with a Dictionary to get O(n) performance in that case, if I understood correctly I need to hash input values into a bucket index. So we need some sort of hash function with O(1) complexity? Any ideas how to do it?

Community
  • 1
  • 1
Andrey
  • 123
  • 1
  • 5
  • 1
    Can you describe the boundaries array a little better? Is there any relationship between the various boundaries (i.e. are they sequential) or are they completely random in size and "location"? I assume the boundaries array completely covers the range of possible values - is that true? Also, I'm assuming there are no overlaps - right? – Mike Dinescu Aug 31 '11 at 15:21
  • fastest in the meaning of the big "O" or in the meaning of little code? A simple approach would be to write yourself a function Func and use this with Linqs .GroupBy to group this into "Buckets" - but there might be computational faster ways to do this. – Random Dev Aug 31 '11 at 15:24
  • Yes, you are right. The boundary values are monotonically increasing in value. They are no overlaps and cover the range of possible values. So for example: 0, 10, 50, 100, 120. – Andrey Aug 31 '11 at 15:28
  • Fastest in the meaning of the big "O", no Linqs . GroupBy, just computational way. – Andrey Aug 31 '11 at 15:30
  • An easy but not too fast solution is a binary search. – CodesInChaos Aug 31 '11 at 15:46
  • What type are the values? If they are .NET primitives (int, double, string, decimal, datetime, etc.) then thy already have good O(1) hasfunctions, and you don't need to concern about them at all. Just use a `Dictionary` and be done with it. However I'd like to note that binary search is VERY fast, and will rival the hash function solution. Better test which one is faster in your case. – Vilx- Sep 01 '11 at 07:13

2 Answers2

4

Bucket Sort is already O(n^2) worst case, so I would just do a simple inner/outer loop here. Since your bucket array is necessarily shorter than your input array, keep it on the inner loop. Since you're using custom bucket sizes, there are really no mathematical tricks that can eliminate that inner loop.

int[] freq = new int[buckets.length - 1];
foreach(int d in input)
{
    for(int i = 0; i < buckets.length - 1; i++)
    {
         if(d >= buckets[i] && d < buckets[i+1])
         {
             freq[i]++;
             break;
         }
    }
}

It's also O(n^2) worst case but you can't beat the code simplicity. I wouldn't worry about optimization until it becomes a real issue. If you have a larger bucket array, you could use a binary search of some sort. But, since frequency distributions are typically < 100 elements, I doubt you'd see a lot of real-world performance benefit.

drharris
  • 11,194
  • 5
  • 43
  • 56
  • 1
    What do you think about BucketizedHashtable implementation like it is presented in Java? Or what about array sorting at the beginning of execution, does it make sense? – Jevgenij Nekrasov Aug 31 '11 at 17:30
  • Eliminate the inner loop with a `Dictionary` to get amortized O(n) perf. – Hans Passant Aug 31 '11 at 20:56
  • @Hans What do you mean? I do not really understand :( – Andrey Aug 31 '11 at 21:01
  • 1
    @Jevgenij - Bucketized Hashtables typically work on standard bucket sizes, as I understand it. This works well because instead of looping through the buckets array, you use a function that inputs the value and outputs the bucket number. If the conversion function runs in O(1), then you can have O(n) performance since it eliminates the requirement for any inner loop at all. This is similar to what @Hans was saying by using a `Dictionary` - but it requires some way to hash the input value into a bucket index. As far as array sorting, you're only going to increase algorithm complexity. – drharris Sep 01 '11 at 03:04
  • Inner loop can be replaced by binary search, obtaining an overall O(n*log(m)), where n - input count; m - bucket count. – Vilx- Sep 01 '11 at 07:05
  • I mentioned the binary search option, but unless your buckets array is super long, you won't see any real-world performance benefit, and you might even see worse performance. Binary search is good, but it has overhead, and doing it thousands of times in a loop can actually be worse than simply iterating an inner loop, especially for smaller arrays. But, it is an option if the OP wants to go that route. – drharris Sep 01 '11 at 15:50
1

If your input array represents real world data (with its patterns) and array of boundaries is large to iterate it again and again in inner loop you can consider the following approach:

  • First of all sort your input array. If you work with real-world data I would recommend to consider Timsort - Wiki for this. It provides very good performance guarantees for a patterns that can be seen in real-world data.

  • Traverse through sorted array and compare it with the first value in the array of boundaries:

    • If value in input array is less then boundary - increment frequency counter for this boundary
    • If value in input array is bigger then boundary - go to the next value in array of boundaries and increment the counter for new boundary.

In a code it can look like this:

Timsort(myArray);
int boundPos; 
boundaries = GetBoundaries(); //assume the boundaries is a Dictionary<int,int>()

for (int i = 0; i<myArray.Lenght; i++) {
  if (myArray[i]<boundaries[boundPos]) { 
     boundaries[boubdPos]++;
  }
  else {
    boundPos++;
    boundaries[boubdPos]++;
  }
}
Andrey Taptunov
  • 9,367
  • 5
  • 31
  • 44
  • 1
    boundaries are represented with array of values. but what about complexity? as I understood for Timsort in the worst case O(nlogn) + O(n) for looping. I think inner/outer loop whith binary search should be better? – Andrey Sep 01 '11 at 07:20
  • 2
    Not quite right. This will fail if there is an "empty" bucket in the middle. That is, there are two input values in the sorted array which are next to each other, but go into buckets that are not next to each other. But that can be fixed. All in all, this is a very good idea. Depending on the data, it might even be possible to use Radix Sort, which is O(n), though it might require a lot of data to make it worthwhile. But the overall runtime would be a clean O(n). – Vilx- Sep 01 '11 at 08:03
  • P.S. Sorry for posting this text as an answer. It was meant to be a comment. – Vilx- Sep 01 '11 at 08:03
  • @Vilx-, Agree and thanks for correction. I didn't consider this case. – Andrey Taptunov Sep 01 '11 at 08:14