vector instructions ("vcl" and "ume") for counting sort

Question

I'm trying vector instruction using libraries "vcl" and "ume" for a kind of counting sort, which gives only the position back

// icpc sort.cpp -xCORE_AVX2 -o c
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include "vcl/vectorclass.h"
#include "umesimd/UMESimd.h"
using namespace UME::SIMD;

int main(void) {
    //scalar version as example __________________________________________
    int s0[4], k0 = 0, x0[4] = { 9,
                                 1,
                                 2,
                                 1};
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++)
            if (x0[j] == i) {
                s0[k0] = j;
                k0++; }
    for (int j = 0; j < 4; j++) printf("%i\n", s0[j]); printf("\n");

    int a[32] = {9, 1,  5,  2,  1,  6,  3,  5,
                1,  4,  0,  3,  4,  7,  1,  1,
                2,  2,  1,  4,  4,  7,  2,  5,
                1,  4,  0,  1,  6,  4,  6,  5};
    //vcl version _____________________________________________________________________
    Vec8i s1[4] = 0, k1 = 0, x1[4]; Vec8ib msk1;
    x1[0].load(a);  x1[1].load(a + 8);  x1[2].load(a + 16); x1[3].load(a + 24);
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", x1[i].extract(j)); printf("\n"); } printf("\n");
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++) {
            msk1 = (x1[j] == i);
            //s1[k1] = select(msk1, j, s1);
            k1 = select(msk1, k1 + 1, k1); }
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", s1[i].extract(j)); printf("\n"); } printf("\n");

    //ume version ______________________________________________________________________
    SIMD8_32i s2[4] = 0, k2 = 0, x2[4]; SIMDMask8 msk2;
    x2[0].load(a); x2[1].load(a + 8); x2[2].load(a + 16); x2[3].load(a + 24);
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", x2[i].extract(j)); printf("\n"); } printf("\n");
    for (int i = 9; i >= 0; i--)
        for (int j = 0; j < 4; j++) {
            msk2 = x2[j].cmpeq(i);
            //s2[k2].assign(msk2, j);
            k2.assign(msk2, k2 + 1); }
    for (int i = 0; i < 4; i++) { for (int j = 0; j < 8; j++) printf("%i\t", s2[i].extract(j)); printf("\n"); } printf("\n");

    return 0;
}

unfortunately I cannot find the solution for any of the libraries and I need help to solve it (commented lines)

Peter Cordes · Accepted Answer · 2019-06-08T17:36:18.260

0

The hard part of Counting Sort is the histogram problem.

Histograms are hard for SIMD because you require gather loads and scatter stores, and detection of conflicts (where your vector of input elements has duplicates, so you need to increment the same bucket multiple times). x86 only has this with AVX512.

Without gather/scatter/conflict-detection, there are scalar tricks that are useful for small numbers of buckets. e.g. replicating the count array 4 times to mitigate store/reload store-forwarding bottlenecks if the same input number appears makes up most of the input for a while. (With out of order execution, it doesn't have to be 100% contiguous to cause a problem.)

You can use SIMD to sum up the bucket arrays at the end, e.g. _mm256_add_epi32 for 32-bit counters. There's some scope for manual SIMD optimization of the memset / std::fill part of producing the sorted output, but usually this isn't the slow part.

Further reading:

Methods to vectorise histogram in SIMD? shows the array-of-multiple-tables trick.
How to speed up this histogram of LUT lookups? shows code implementing this, which gcc auto-vectorizes with AVX2. And discussion of AVX512. (But note that this is histogramming the result of a LUT lookup, so there's an extra layer of indirection which we can SIMD with AVX2 gathers. That's not the case for your Counting Sort)
How do the Conflict Detection instructions make it easier to vectorize loops? AVX512CD for histograms

edited Jun 08 '19 at 17:36

answered Jun 08 '19 at 08:28

Peter Cordes

328,167
45
605
847

thank you for the clarification. I'm a simply engineer, not a programmer and it took me a while to understand the answer. Anyway, I used just the beginning of the count-sort, or better said just the first idea until positions are found, maybe your remarks on histogram problem are coming with the second part. The bottlenecks of my code is in this "stable" sort, this is why I wanted to "simd" it. If you have link on that "scalar tricks" for small array, please be so kind an let me know. – mimi Jun 08 '19 at 12:45
@mimi: added links to existing SO Q&As. – Peter Cordes Jun 08 '19 at 17:36
thank you for the hint. Reading more about the subject I switched to a Dr Dobbs insertion sort and I got a speedup of 40%. I found also [parallel counting sort](http://www.drdobbs.com/architecture-and-design/parallel-counting-sort/224700144), but I think a vector instructions Insertion Sort, it will be better. If you know any helpful material about it, please be so kind and let me know. – mimi Jun 09 '19 at 13:55
@mimi: for *large* arrays with a small number of possible inputs (like uint8_t or smaller), Counting Sort should be excellent; it's an O(n) sort and yes you can multithread it (with each thread having a private array of counts). And you could multi-thread the output phase with prefix sums to figure out where to start a chunk. – Peter Cordes Jun 09 '19 at 14:29
@mimi: **For short inputs with SIMD, the state of the art is usually SIMD sorting networks** with shuffles and SSE4.1 `_mm_min_epi32` / `_mm_max_epi32` or the AVX2 / AVX512 wider versions. Or for float, `minps`/maxps`. Either way use `shufps` even for integer. [Very fast sorting of fixed length arrays using comparator networks](//stackoverflow.com/q/19790522) shows some pure C++ sorting networks, and see [Find 4 minimal values in 4 \_\_m256d registers](//stackoverflow.com/q/35945228) for some links to x86 SIMD sorting. – Peter Cordes Jun 09 '19 at 14:35
yes I have short inputs - many arrays with 8-16 elements between 0 and 9 (because I "normalized" them). You put me on the right track and now, theoretically I have all the ingredients (I am very close to the roof line L1 cache and I will change the code to go more on the right). We can consider this topic solved. Thank you ! – mimi Jun 09 '19 at 20:16

vector instructions ("vcl" and "ume") for counting sort

1 Answers1