Approximate sort (array/vector), predictable runtime

Question

Background:

I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.

What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.

Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).

However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).

The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.

Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.

tl;dr

So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.

The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.

In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.

Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".

MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?

Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?

What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. — Ronald, Jan 29 '14 at 15:12
I'd still go with a priority queue. Pulling the objects off the queue should be locked so that it can't be parallel, but as long as the processing time > pull time then you should be fine. Creating the queue from the list will put it into (essentially) a vector, and if you are processing the objects on different threads as you are saying then the cash-friendliness doesn't matter too much. And depending on how approximate is approximate, you *could* get ahold of the underlying container or use std::make_heap and read the heap row by row (though this is far from accurate) — IdeaHat, Jan 29 '14 at 15:33
@Ronald: That is an embarrassingly simple and clever idea! One could even directly drop events into one of, say, 3-4 "buckets" (i.e. separate vectors/queues) as they are received, without explicitly iterating over them later. That is practically a runtime of _zero_ for sorting (except for 2-3 extra instructions when receiving an event), but it will fulfill the requirements, and will be very cache friendly, trivial to parallelize processing, and dead simple to implement. Care to post your idea as answer? — Damon, Jan 29 '14 at 15:40
@MadScienceDreams: Cache-friendly access patterns matter _a lot_ whenever processing more than a few dozen "things", also with concurrency. Having one worker iterate over, say, 100kiB of contiguous memory and the next one iterate over the next 100kiB is dozens of times faster than having them both access random locations (concurrent writes are even worse). Locking is forbidding due to the sheer overhead (but I guess it wouldn't be necessary either since there's nothing to parallelize when pulling form a prio queue anyway). While I agree that a prio queue is _formally_ the correct thing... — Damon, Jan 29 '14 at 15:56
... as written in the Q too, the shoe doesn't quite fit for the problem. On the other hand, Ronald's solution is so good (it's one of the "bang head on table, why didn't I think of that myself?!" ideas) that it fits perfectly. It is practically zero overhead and addresses the problem without even thinking about how to sort things in the most efficient manner. — Damon, Jan 29 '14 at 15:57
@Damon (not arguing, if that works than use it). You seems to already know this but you'll have to create the queues separately then, one for each processor. If processor 1 is reading the first bit of a memory block and processor 2 is reading the second bit you'll still get a cache miss most of the time. So if your list is being sorted such that it can read down the line, what you really want is each processor to get its own contiguous queue :-/ — IdeaHat, Jan 29 '14 at 16:19
@MadScienceDreams Yes that would certainly be more cache friendly... similar to my answer — David, Jan 29 '14 at 16:34
What is your exact constraint: a) maximize (possibly priority-weighted) events per time quantum; b) guarantee minimum number of events per time quantum? I.e., if you are a financial trader, option a) would be likely, but in flight/medical devices b) sounds more likely. In any case, the optimal algorithm would probably be quite different. — TemplateRex, Jan 29 '14 at 20:03
Another question is how your events are being generated: is each batch generated in say quantum `[t]` and to be processed in quantum `[t+1]`, or are they coming in continuously (i.e. events arriving in `[t]` can be processed in `[t]`)? — TemplateRex, Jan 29 '14 at 20:23
@TemplateRex: Constraints are to a) not lose any "important" events (_may_ lose non-important ones) and b) process as many events as possible during same quantum, preferrably all. Events are generated in response to UDP datagrams on 10G ethernet, so ca. 10-12 million/second theoretical maximum due to B/W (expected 100k, but you never know). Processing an event is not a trivial operation (involves several random reads (tree traversal), some logic, some writes to private memory, and an atomic operation). — Damon, Jan 29 '14 at 21:27

score 3 · Accepted Answer · answered Jan 29 '14 at 16:09

3

What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)

answered Jan 29 '14 at 16:09

Ronald

2,842
16
16

1

this is the `std::partition` approach, which has the drawback that you don't know in advance how many elements satisfy the threshold importance. This means that you are not guaranteed to process the most valuable events (even though all events are above the threshold). In contrast, `std::nth_element` will max out the available bandwith and select the most valuable ones (both algorithms are `O(N)`). – TemplateRex Jan 29 '14 at 21:16

score 3 · Answer 2 · answered Jan 29 '14 at 16:42

3

Use a separate vector for each priority. Then you don't need to sort them.

answered Jan 29 '14 at 16:42

Ben

34,935
6
74
113

Or a vector for each bucket of priority ranges, in case there are too many priorities (e.g. priority based on a double) – Peter Jan 29 '14 at 16:45

score 3 · Answer 3 · answered Jan 29 '14 at 17:30

Sounds like a nice example where near-sort algorithms can be useful.

Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.

The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.

To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:

https://en.wikipedia.org/wiki/Soft_heap

There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms

If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.

Turned out that there's a simpler solution (see accepted answer) that doesn't need approximate sorting at all, but I'll keep this one in mind for when I really need an approx sort, excellent hint :-) — Damon, Jan 29 '14 at 21:15

score 1 · Answer 4 · answered Jan 29 '14 at 16:28

1

Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.

answered Jan 29 '14 at 16:28

Jan

1,807
13
26

A so called 'approximate' sorting algorithm may use this as an implementation detail. For example you can write a quicksort using it, and to make the quicksort approximate restrict it to N recursion levels. – David Jan 29 '14 at 16:30
+1 There is always one function in the standard library that you don't know about... :) I'll stick with Ronald's idea since it doesn't require moving stuff at all, but still, very interesting to know. – Damon Jan 29 '14 at 21:13

TemplateRex · Answer 5 · 2014-01-29T21:21:17.583

If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).

std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process

if (events.size() <= guaranteed_bandwidth) {
    // let all N workers loose on [begin(events), end(events)) range
} else {
    auto nth = guaranteed_bandwidth * loss_from_nth_element;
    std::nth_element(begin(events), begin(events) + nth);
    // let all N workers loose on [begin(events), nth) range

    // reprioritize [nth, end(events)) range and append to events for next time quantum         
}

This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).

The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.

NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.

+1 I had thought of `nth_element` initially but dismissed it as "not useful" since it really only guarantees the left half. So what about the rest... But of course, if one assumes the constraint "must process N per quantum" (not the case for me, but you could legitimately interprete my question that way) this is a great idea that would work perfectly (and straightforward implementation). — Damon, Jan 29 '14 at 21:30
@Damon note that I don't interpret as "must process N per quantum", but rather not let any resources go to waste and do most valuable events first. The problem with `std::partition` with an *absolute* priority threshold criterion is that you could have 90% of your data satisfy that criterion. This means that you will roll-over a lot of your data in case you get millions of events. The `nth_element` is a *relative* criterion that will guarantee that you do the best stuff first, not just the good enough stuff. — TemplateRex, Jan 29 '14 at 21:36
Right, since it guarantees that everything in the left half is smaller than the pivot _and_ the left half is properly sorted. So the "best stuff" isn't just in the first "valuable" half, but really comes first, too. Just what one would want, really. — Damon, Jan 29 '14 at 23:37
@Damon no, the left part is not sorted. If you also want that, there is `std::partial_sort`. — TemplateRex, Jan 30 '14 at 08:18

David · Answer 6 · 2014-01-29T16:38:03.267

If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.

This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

Approximate sort (array/vector), predictable runtime

Background:

tl;dr

6 Answers6