Parallelizing radix sort for floating point numbers using pthread library in C

Question

I am trying to parallelize the radix sort using POSIX threads using C language. The specialty is the radix sort needs to be implemented for floating-point numbers. Currently, the code is running sequentially but I have no idea how to parallelize the code. Can anyone help me with this? Any help is appreciated.

What have you tried? What problems did you encounter? Note that you're not really asking a question. For reference, also read [ask]. — Ulrich Eckhardt, Aug 05 '21 at 18:04

score 0 · Answer 1 · answered Aug 05 '21 at 18:01

Radix sorts are pretty hard to parallelize efficiently on CPUs. There is two parts in a radix sort: the creation of the histogram and the bucket filling.

To create an histogram in parallel you can fill local histograms in each thread and then perform a (tree-based) reduction of the histograms to build a global one. This strategy scale well as long as the histogram are small relative to the data chunks computed by each thread. An alternative way to parallelize this step is to use atomic adds to fill directly a shared histogram. This last method scale barely when thread write accesses conflicts (which often happens on small histograms and many threads). Note that in both solutions, the input array is evenly distributed between threads.

Regarding the bucket filling part, one solution is to make use of atomic adds to fill the buckets: 1 atomic counter per bucket is needed so that each thread can push back items safely. This solution only scale when threads do not often access to the same bucket (bucket conflicts). This solution is not great as the scalability of the algorithm is strongly dependent of the content of the input array (sequential in the worst case). There are solutions to reduces conflicts between threads (better scalability) at the expense of more work (slower with few threads). One is to fill the buckets from both sides: threads with an even ID fill the buckets in ascending order while threads with an odd ID fill them in descending order. Note that it is important to take into account false sharing to maximize performance.

Atomic adds will be impacted when two or more threads access the same cache line, not just the same bucket. Cache coherency between separate cores running those threads is normally based on cache lines. To avoid this, the threads could generate local histograms, and then exit (all but one thread), then the local histograms could be added to form a single histogram. — rcgldr, Aug 09 '21 at 17:42
Indeed, that was the point of my last sentence (regarding the false sharing). One solution to reduce the effect of false sharing is to use padding between buckets. Another solution is to use randomization. State-of-the-art algorithms seems to use a tree-based atomic reduction with weakly-shared histograms (see OpenMP runtimes like IOMP). Sadly, none of the methods are really great. — Jérôme Richard, Aug 09 '21 at 19:03
As mentioned in my answer, for T threads, splitting up the array into T parts and generating T histograms, then converting them into T arrays of bucket starting indexes, at least gets that part parallelized. You could then use T/2 threads to sum the indexes, then T/4 threads to sum the results, ... , and a final single thread to produce the final array of indices. — rcgldr, Aug 10 '21 at 04:01

rcgldr · Answer 2 · 2021-08-10T19:02:24.750

A simple way to parallelize radix sort for all but the first pass is to use a most significant digit (MSD) pass to split up the array into bins, each of which can then be sorted concurrently. This approach relies on having somewhat uniform distribution of values, at least in terms of the most significant digit, so that the bins are reasonably equal in size.

For example, using a digit size of 8 bits (base 256), use a MSD pass to split up the array into 256 bins. Assuming there are t threads, then sort t bins at a time, using least significant digit first radix sort.

For larger arrays, it may help to use a larger initial digit size to split up the array into a larger number of bins, with the goal of getting t bins to fit in cache.

Link to a non-parallelized radix sort that uses MSD for first pass, then LSD for next 3 passes. The loop at the end of RadixSort() to sort the 256 bins could be parallelized:

Radix Sort Optimization

For the first pass, you could use the parallel method in Jerome Richard's answer, but depending on the data pattern, it may not help much, due to cache and memory conflicts.

Parallelizing radix sort for floating point numbers using pthread library in C

2 Answers2