4

I'm looking for an efficient algorithm to perform the following: given an array of N items, sort it in a way so that items come as M equal groups, where each group is unsorted, but groups are sorted between each other (all elements in one group are less than any element of the next group).

The easiest way is to just sort the whole array. But it's inefficient, especially if the number of groups is much smaller than the total number of items (e.g. sort a million items into 5 groups).

Currently I've settled on using the quickselect algorithm (specifically, it's Floyd-Rivest variation) that sorts an array into 2 unsorted groups, and then applying it M-1 times. An significant improvement may be to apply divide-and-conquer approach to quickselect, first sorting the array into two groups, then sorting each half, etc., until we have M groups. Kind of a generalization of unordered partial sorting problem.

But I have a gut feeling that there might be a simpler and more efficient approach to the problem. Is there anything I'm missing? Any ideas? I need this for improving bulk insertion performance in my RBush JavaScript library (an efficient R*-tree implementation).

Kara
  • 6,115
  • 16
  • 50
  • 57
Mourner
  • 3,171
  • 24
  • 21

1 Answers1

2

Here is a restatement of the problem. You need to simultaneously find M-1 ranked elements, and use them to divide the array into M unsorted groups.

I would suggest starting with standard quickselect with random pivots chosen to be close to the median (do the random subsample trick to estimate that), for each case dividing the array into 2, and also dividing the list of ranked elements you need to find into 2 as well. Continue this until you get down to the level where you have a ranked element to find in your current group. Then switch to the Floyd-Rivest variation of quickselect to actually find that element and split the current group into 2. Then coming back out of the quickselect, you can easily piece together the M groups that you want.

This approach has an expected running time of O(N log(M)) with pretty good constants. I doubt that significantly better than that is doable.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • That sounds brilliant! – Mourner Aug 31 '14 at 18:59
  • So if I understood you right, it can be one by running one quicksort-like pass (when you recurse in both sides instead of one side) to find all ranked elements, right? Then you would sort the ranked elements and run partition subroutine on each ranked element index, resulting in an array of unsorted groups I need? – Mourner Aug 31 '14 at 19:04
  • 1
    @Mourner Even simpler than that. You find the ranked elements at the bottom of the quickselect, but don't need them. You just need to know what the groups are on the left and right. But in the process of finding the ranked elements, you wind up having generated lots of partitions of the array for elements bounded from one point to another. On the way back out, you know which unranked groups each partition should go into. So you recurse down to locate the ranked elements, and on your way out of the recursive solution you piece together the groups that you are looking for. – btilly Aug 31 '14 at 19:40
  • Hmm... Given that I need to reorder the original array in the way I want, it looks like after I found all the ranked elements, the array will be already sorted into groups properly without any additional work on the way out — right? – Mourner Aug 31 '14 at 19:53
  • OK, tried it out here: https://github.com/mourner/rbush/commit/91097fb9f57f3824fe7519130bfa116abc935ca9 – Mourner Aug 31 '14 at 23:00
  • OK, the results are not good. In my tests, it requires about 2x more swaps compared to my previous approach (divide & conquer Floyd-Rivest), and performs up to 50% slower. Unless I made a mistake there somewhere, I'll probably have abandon the suggested approach. In the code, pivot is middle — random pivot is about the same, median of three performs even worse. – Mourner Aug 31 '14 at 23:03
  • 1
    Scratch the 2x more swaps comment — there are actually a bit less swaps and comparisons, but for small number M (say 4), the performance is 20% slower. For M = 32, it's about 30% faster. Still, I didn't expect it to be slower at any number... – Mourner Aug 31 '14 at 23:29
  • 1
    @Mourner The performance for small M is likely due to poor choices of initial pivots. That is fixable. Looking at your code, I suspect that you might not have handled identical elements correctly. You should test that. – btilly Sep 01 '14 at 00:10