Radix Sort Optimization

Question

I was trying to optimize the Radix Sort code, because I felt there was room for it as traditional codes in books and on web seem a direct copy of one another and also they work very slow as they take an arbitrary number such as 10 for modulo operation. I have optimized the code as far as I could go, maybe I might have missed some optimization techniques. In that case please enlighten me.

Motivation for optimization:
http://codercorner.com/RadixSortRevisited.htm
http://stereopsis.com/radix.html
I was unable to implement all the optimizations in the articles, mostly it was beyond my skills and understanding and lack of sufficient time, if you can feel free to implement them.

EDIT 4
This Java version of Radix Sort calculates all histograms in 1 read and does not need to fill array Z with zeros after every LSB sort along with the usual ability to skip sorting and jump to next LSB sorting if all previous LSB's are same. As usual this is only for 32-bit integers but a 64-bit version can be created from it.

protected static int[] DSC(int A[])// Sorts in descending order
{
    int tmp[] = new int[A.length] ;
    int Z[] = new int[1024] ;
    int i, Jump, Jump2, Jump3, Jump4, swap[] ;

    Jump = A[0] & 255 ;
    Z[Jump] = 1 ;
    Jump2 = ((A[0] >> 8) & 255) + 256 ;
    Z[Jump2] = 1 ;
    Jump3 = ((A[0] >> 16) & 255) + 512 ;
    Z[Jump3] = 1 ;
    Jump4 = (A[0] >> 24) + 768 ;
    Z[Jump4] = 1 ;

    // Histograms creation
    for (i = 1 ; i < A.length; ++i)
    {
        ++Z[A[i] & 255] ;
        ++Z[((A[i] >> 8) & 255) + 256] ;
        ++Z[((A[i] >> 16) & 255) + 512] ;
        ++Z[(A[i] >> 24) + 768] ;
    }

    // 1st LSB Byte Sort
    if( Z[Jump] != A.length )
    {
        Z[0] = A.length - Z[0];
        for (i = 1; i < 256; ++i)
        {
            Z[i] = Z[i - 1] - Z[i];
        }
        for (i = 0; i < A.length; ++i)
        {
            tmp[Z[A[i] & 255]++] = A[i];
        }
        swap = A ; A = tmp ; tmp = swap ;
    }

    // 2nd LSB Byte Sort
    if( Z[Jump2] != A.length )
    {
        Z[256] = A.length - Z[256];
        for (i = 257; i < 512; ++i)
        {
            Z[i] = Z[i - 1] - Z[i];
        }
        for (i = 0; i < A.length; ++i)
        {
            tmp[Z[((A[i] >> 8) & 255) + 256]++] = A[i];
        }
        swap = A ; A = tmp ; tmp = swap ;
    }

    // 3rd LSB Byte Sort
    if( Z[Jump3] != A.length )
    {
        Z[512] = A.length - Z[512];
        for (i = 513; i < 768; ++i)
        {
            Z[i] = Z[i - 1] - Z[i];
        }
        for (i = 0; i < A.length; ++i)
        {
            tmp[Z[((A[i] >> 16) & 255) + 512]++] = A[i];
        }
        swap = A ; A = tmp ; tmp = swap ;
    }

    // 4th LSB Byte Sort
    if( Z[Jump4] != A.length )
    {
        Z[768] = A.length - Z[768];
        for (i = 769; i < Z.length; ++i)
        {
            Z[i] = Z[i - 1] - Z[i];
        }
        for (i = 0; i < A.length; ++i)
        {
            tmp[Z[(A[i] >> 24) + 768]++] = A[i];
        }
        return tmp ;
    }
    return A ;
}

The Java version ran faster with != sign than == sign

if( Z[Jump] != A.length )
{
    // lines of code
}...

but in C the below version was on average, 25% faster (with equalto sign) than its counterpart with != sign. Your hardware might react differently.

if( Z[Jump] == A.length );
else
{
    // lines of code
}...

Below is the C code ( "long" on my machine is 32 bits )

long* Radix_2_ac_long(long *A, size_t N, long *Temp)// Sorts in ascending order
{
    size_t Z[1024] = {0};
    long *swp;
    size_t i, Jump, Jump2, Jump3, Jump4;

    // Sort-circuit set-up
    Jump = *A & 255;
    Z[Jump] = 1;
    Jump2 = ((*A >> 8) & 255) + 256;
    Z[Jump2] = 1;
    Jump3 = ((*A >> 16) & 255) + 512;
    Z[Jump3] = 1;
    Jump4 = (*A >> 24) + 768;
    Z[Jump4] = 1;

    // Histograms creation
    for(i = 1 ; i < N ; ++i)
    {
        ++Z[*(A+i) & 255];
        ++Z[((*(A+i) >> 8) & 255) + 256];
        ++Z[((*(A+i) >> 16) & 255) + 512];
        ++Z[(*(A+i) >> 24) + 768];
    }

    // 1st LSB byte sort
    if( Z[Jump] == N );
    else
    {
        for( i = 1 ; i < 256 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i < N ; --i )
        {
            *(--Z[*(A+i) & 255] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 2nd LSB byte sort
    if( Z[Jump2] == N );
    else
    {
        for( i = 257 ; i < 512 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i < N ; --i )
        {
            *(--Z[((*(A+i) >> 8) & 255) + 256] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 3rd LSB byte sort
    if( Z[Jump3] == N );
    else
    {
        for( i = 513 ; i < 768 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i < N ; --i )
        {
            *(--Z[((*(A+i) >> 16) & 255) + 512] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 4th LSB byte sort
    if( Z[Jump4] == N );
    else
    {
        for( i = 769 ; i < 1024 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i < N ; --i )
        {
            *(--Z[(*(A+i) >> 24) + 768] + Temp) = *(A+i);
        }
        return Temp;
    }
    return A;
}

EDIT 5
The sort now handles negative numbers too. Only some minor/negligible tweaks to the code did it. It runs a little slower as a result but the effect is not significant. Coded in C, below ( "long" on my system is 32 bits )

long* Radix_Sort(long *A, size_t N, long *Temp)
{
    size_t Z[1024] = {0};
    long *swp;
    size_t Jump, Jump2, Jump3, Jump4;
    long i;

    // Sort-circuit set-up
    Jump = *A & 255;
    Z[Jump] = 1;
    Jump2 = ((*A >> 8) & 255) + 256;
    Z[Jump2] = 1;
    Jump3 = ((*A >> 16) & 255) + 512;
    Z[Jump3] = 1;
    Jump4 = ((*A >> 24) & 255) + 768;
    Z[Jump4] = 1;

    // Histograms creation
    for(i = 1 ; i < N ; ++i)
    {
        ++Z[*(A+i) & 255];
        ++Z[((*(A+i) >> 8) & 255) + 256];
        ++Z[((*(A+i) >> 16) & 255) + 512];
        ++Z[((*(A+i) >> 24) & 255) + 768];
    }

    // 1st LSB byte sort
    if( Z[Jump] == N );
    else
    {
        for( i = 1 ; i < 256 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i >= 0 ; --i )
        {
            *(--Z[*(A+i) & 255] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 2nd LSB byte sort
    if( Z[Jump2] == N );
    else
    {
        for( i = 257 ; i < 512 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i >= 0 ; --i )
        {
            *(--Z[((*(A+i) >> 8) & 255) + 256] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 3rd LSB byte sort
    if( Z[Jump3] == N );
    else
    {
        for( i = 513 ; i < 768 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i >= 0 ; --i )
        {
            *(--Z[((*(A+i) >> 16) & 255) + 512] + Temp) = *(A+i);
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 4th LSB byte sort and negative numbers sort
    if( Z[Jump4] == N );
    else
    {
        for( i = 897 ; i < 1024 ; ++i )// -ve values frequency starts after index 895, i.e at 896 ( 896 = 768 + 128 ), goes upto 1023
        {
            Z[i] = Z[i-1] + Z[i];
        }
        Z[768] = Z[768] + Z[1023];
        for( i = 769 ; i < 896 ; ++i )
        {
            Z[i] = Z[i-1] + Z[i];
        }
        for( i = N-1 ; i >= 0 ; --i )
        {
            *(--Z[((*(A+i) >> 24) & 255) + 768] + Temp) = *(A+i);
        }
        return Temp;
    }
    return A;
}

EDIT 6
Below is the pointer optimized version ( accesses array locations via pointers ) that takes on average, approximately 20% less time to sort than the one above. It also uses 4 separate arrays for faster address calculation ( "long" on my system is 32 bits ).

long* Radix_Sort(long *A, size_t N, long *Temp)
{
    long Z1[256] ;
    long Z2[256] ;
    long Z3[256] ;
    long Z4[256] ;
    long T = 0 ;
    while(T != 256)
    {
        *(Z1+T) = 0 ;
        *(Z2+T) = 0 ;
        *(Z3+T) = 0 ;
        *(Z4+T) = 0 ;
        ++T;
    }
    size_t Jump, Jump2, Jump3, Jump4;

    // Sort-circuit set-up
    Jump = *A & 255 ;
    Z1[Jump] = 1;
    Jump2 = (*A >> 8) & 255 ;
    Z2[Jump2] = 1;
    Jump3 = (*A >> 16) & 255 ;
    Z3[Jump3] = 1;
    Jump4 = (*A >> 24) & 255 ;
    Z4[Jump4] = 1;

    // Histograms creation
    long *swp = A + N;
    long *i = A + 1;
    for( ; i != swp ; ++i)
    {
        ++Z1[*i & 255];
        ++Z2[(*i >> 8) & 255];
        ++Z3[(*i >> 16) & 255];
        ++Z4[(*i >> 24) & 255];
    }

    // 1st LSB byte sort
    if( Z1[Jump] == N );
    else
    {
        swp = Z1+256 ;
        for( i = Z1+1 ; i != swp ; ++i )
        {
            *i = *(i-1) + *i;
        }
        swp = A-1;
        for( i = A+N-1 ; i != swp ; --i )
        {
            *(--Z1[*i & 255] + Temp) = *i;
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 2nd LSB byte sort
    if( Z2[Jump2] == N );
    else
    {
        swp = Z2+256 ;
        for( i = Z2+1 ; i != swp ; ++i )
        {
            *i = *(i-1) + *i;
        }
        swp = A-1;
        for( i = A+N-1 ; i != swp ; --i )
        {
            *(--Z2[(*i >> 8) & 255] + Temp) = *i;
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 3rd LSB byte sort
    if( Z3[Jump3] == N );
    else
    {
        swp = Z3 + 256 ;
        for( i = Z3+1 ; i != swp ; ++i )
        {
            *i = *(i-1) + *i;
        }
        swp = A-1;
        for( i = A+N-1 ; i != swp ; --i )
        {
            *(--Z3[(*i >> 16) & 255] + Temp) = *i;
        }
        swp = A;
        A = Temp;
        Temp = swp;
    }

    // 4th LSB byte sort and negative numbers sort
    if( Z4[Jump4] == N );
    else
    {
        swp = Z4 + 256 ;
        for( i = Z4+129 ; i != swp ; ++i )
        {
            *i = *(i-1) + *i;
        }
        *Z4 = *Z4 + *(Z4+255) ;
        swp = Z4 + 128 ;
        for( i = Z4+1 ; i != swp ; ++i )
        {
            *i = *(i-1) + *i;
        }
        swp = A - 1;
        for( i = A+N-1 ; i != swp ; --i )
        {
            *(--Z4[(*i >> 24) & 255] + Temp) = *i;
        }
        return Temp;
    }
    return A;
}

It seems likely that some radix values are more common than others. I'd suggest special casing powers of 2 to get bitwise ops instead of division. — aghast, Apr 29 '17 at 22:57
A fast radix sort is LSB, usually base 256 (8 bits). A matrix for the counts that get converted into indices can be used so a single pass over the original array sets the matrix to be used for all radix sort passes. For 32 bit integers, the matrix is [4][256], for 64 bit integers, [8][256]. To deal with positive and negative numbers, assuming the most significant byte is signed (-128 to +127), add a bias of +128 (to end up with the range (0 to 255) for indexing. — rcgldr, Apr 29 '17 at 23:04
If the array is large enough, then using mostly base 2048 (11 bits) and one base 1024 (10 bits) (for 32 bit integers) or one base 512 (for 64 bit integers) may be a bit faster. If the array is larger still, then base 65536 (16 bits) may be a bit faster.. Although a larger base means fewer radix sort passes, cache becomes an issue. In my C/C++ benchmarks it doesn't help much, less than 10% — rcgldr, Apr 29 '17 at 23:06
Build several partial histograms and merge them, building just one creates a lot of unpredictable through-memory dependencies. — harold, Apr 30 '17 at 11:23
% and / are possibly the slowest operations, you should definitely use bit shifting and masking instead (and do not allow arbitrary base). — Antonín Lejsek, Apr 30 '17 at 14:16
@AustinHastings really thanks a lot for them, I really didnt think about that, thats giving me a nice gain in speed, +1. Have updated the code and the question. If you can, please have a look at the updated section on bitwise AND operation. — ytoamn, Apr 30 '17 at 15:51
@AntonínLejsek really thanks a lot for them, I really didnt think about that, thats giving me a nice gain in speed, +1. Have updated the code and the question. If you can, please have a look at the updated section on bitwise AND operation. — ytoamn, Apr 30 '17 at 15:52
@harold could you please elaborate a little more or point me to some web resources as I dont seem have any clue / understanding on that — ytoamn, Apr 30 '17 at 16:08
@ytoamn see [this](https://docs.google.com/document/d/18gs0bkEwQ5cO8pMXT_MsOa8Xey4NEavXq-OvtdUXKck/pub) and [this](http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/) but in Java you can't go that far, just doing multiple partial histograms at once is about all you can do but that's also already quite effective — harold, Apr 30 '17 at 16:21
@rcgldr my integers are 32 bits, so I will stick to 256 as base, but 1 question, how do I get to the 2nd LSB, 3rd and 4th LSB ? I think I have to bitshift to right by 8 bits in every radix pass, that means chopping off an LSB byte or dividing by base in every radix sort pass. Is there another alternative to for those LSBs ? — ytoamn, Apr 30 '17 at 20:10
@ytoamn - Assuming `e` is an integer from the array to be sorted, for the LSB, use `e & 0xff`, 2nd LSB use `(e>>8) & 0xff`, 3rd LSB `(e>>16) & 0xff`, 4th LSB `((e>>24)+0x80) & 0xff` . This will be needed for indexing when generating the matrix of counts, and again when doing the radix sort. — rcgldr, Apr 30 '17 at 20:15
@rcgldr what if I store the byte shifted results in an array, I could operate on them iteratively, though that will consume memory it may improve speed, no idea, and then once the indices are available I will need to sort the array 4 times ( using indices of each pass ) to get the result. I hope i follow you right. Also if the radix is 1 i.e. all the elements are single digits, do you think I need to go till the 4th LSB byte or even the 3rd or 2nd LSB bytes? as the maximum keyboard digit that can be entered is 9. well within the right 8 bits — ytoamn, Apr 30 '17 at 20:26
@rcgldr Also instead of using 8 bits base for 32 bit integers, what about a base of 16 bits, and why not an arbitrary base i.e. 9 bits or 11 bits — ytoamn, Apr 30 '17 at 20:42
@ytoamn - assuming 32 bit mode, with a 16 bit base, the array of indices takes 65536*4 = 256KB, greater than the typical 32KB per core L1 cache, and using all of the 256KB per core L2 cache. It will fit in the L3 cache though. The number of passes is reduced from 4 to 2, but the cache overhead offsets the gain, so there won't be much difference. Also you need an array large and random enough to fill in most of the [2][65536] matrix of indices, or otherwise it will be slower. I think 2^26 or about 67 million elements was the break even point. — rcgldr, Apr 30 '17 at 23:24
@ytoamn - For 32 bit integers, you could use bit fields of 11, 11, and 10 bits for 3 passes, but then a copy pass is needed. Note that most of the time spent with a radix sort is the random access writes, assuming that the array is much larger than L3 and L4 (if present) cache, (L3 and L4 cache won't be impacted by random access writes). As for a max value check, this reduces the number of passes, but any "extra" passes are sequential copies (sequential read / sequential write), which are fast compared to the random access writes. If the array fits inside the cache, then the max check helps. — rcgldr, Apr 30 '17 at 23:28
@AntonínLejsek I thought defining variables locally leads to wastage of time as time is spent in their memory allocation — ytoamn, May 01 '17 at 21:22
@AntonínLejsek I updated the code in question, thanks a lot, your code runs really fast — ytoamn, May 01 '17 at 21:31
@AntonínLejsek as for the original array being changed, could you elaborate "is some cases and in other not." I think the function in each iteration swaps the partially sorted array to A, so after the final swap A has to contain the sorted array, there exist no other possibility, I think — ytoamn, May 01 '17 at 21:41
@harold I have created a version of Radix Sort that creates multiple histograms in 1 read pass, like as you said in earlier comments, If possible, would you please see it if any more improvement can be done or have I diverged from what you said. The code is in the EDIT 4 section of the question. — ytoamn, May 06 '17 at 21:21
That's an interesting way to apply the technique, I hadn't even thought of that yet. What I would have done, and I'm not saying it's better it's just that it's something in my "bag of tricks", is construct one histogram by first building 4 (or so, maybe some other number) partial histograms (each counting every 4th item, with different offsets) and summing them. The way you use the technique avoids the summing, so that's nice. Have you tried it with separate arrays too, to avoid the `+768` etc? I can't really predict in advance how that works out. — harold, May 07 '17 at 00:43
@harold the separate arrays also work fine but there seems to be no clear winner among those 2 variants, almost half the time the separate array version wins when array size is big, but wins by only few microseconds and loses to the other version when array size becomes small. I might need more tests to be sure. Umm this may be too much to ask but here goes, can you please upload the code for radix sort you are talking about ( the 1 making single histogram from 4 ), it would be of immense help as I would be able to study it directly. Right now I dont know of any methods how to build 1 from 4 — ytoamn, May 07 '17 at 14:51
@harold please tell me how do I calculate one histogram from 4 partial histograms, i cant figure out that by myself — ytoamn, May 09 '17 at 07:23
@ytoamn right sorry, well you can unroll the counting loop by, say, 4 and then count the 0th item in the 0th histogram, the 1st item in the 1th histogram and so on (there may be up to 3 things left to count outside the main loop of course). Then the histogram of everything is just the sums of the histograms. — harold, May 09 '17 at 07:33
@harold I still cant make any sense of it, sorry, do you have a code snippet, or any site, please, sorry again for bothering so much — ytoamn, May 10 '17 at 06:31
@ytoamn like [this](https://pastebin.com/sAyHdv9i), approximately. That's with C# bytes and for length a multiple of 4. — harold, May 10 '17 at 06:56
@AntonínLejsek, Something big has happened, feel free to check out the EDIT 5 section and share your views :) — ytoamn, Jul 06 '17 at 20:06
@ytoamn well did you implement it correctly? Splitting it this way avoids the big latency of a through-memory dependency and also (mostly?) the associated memory dependence misspeculation. — harold, Jul 16 '17 at 14:50
@ytoamn it has been faster in every test I've ever done, and I also refer you to the more professional testers here: http://fastcompression.blogspot.nl/2014/09/counting-bytes-fast-little-trick-from.html the through-memory dependence is the effect where one increment depends on a previous increment and it does so through memory (because we're incrementing array elements here, not registers). That makes the latency through dependent increments higher than usual, and it also has to be predicted what they depend on. — harold, Jul 16 '17 at 14:57
@harold but why use 4 arrays ( h0, h1, h2, h3 ) instead of 6 or more — ytoamn, Jul 16 '17 at 15:13
@ytoamn you could use more, but it doesn't help as much, at least not in my tests. The biggest difference is going from 1 to 2, then going to 3 and 4 helps less, 5+ helps even less - at some point it should get slower but I'm not sure where that point is, actually it would depend on the data so it's hard to generalize.. — harold, Jul 16 '17 at 15:20
@harold I think it might not be needed, have a look at EDIT 6 , the code calculates the subsequent LSBs lying left of the current LSB ( each LSB being 8 bits ) , so thereby counting other LSBs when latency time of first LSB hasn't ended, and it also works for any size of array — ytoamn, Jul 16 '17 at 16:16

rcgldr · Answer 1 · 2018-09-24T03:39:04.823

The edit 4 version is good enough if the original and temp arrays fit in cache. If the array size is much greater than cache size, most of the overhead is due to the random order writes to the arrays. A hybrid msb/lsb radix sort can avoid this issue. For example split the array into 256 bins according to the most significant byte, then do a lsb radix sort on each of the 256 bins. The idea here is that a pair (original and temp) of bins will fit within the cache, where random order writes are not an issue (for most cache implementations).

For a 8MB cache, the goal is for each of the bins to be < 4MB in size = 1 million 32 bit integers if the integers evenly distribute into the bins. This strategy would work for array size up to 256 million 32 bit integers. For larger arrays, the msb phase could split up the array into 1024 bins, for up to 1 billion 32 bit integers. On my system, sorting 16,777,216 (2^24) 32 bit integers with a classic 8,8,8,8 lsb radix sort took 0.45 seconds, while the hybrid 8 msb : 8,8,8 lsb took 0.24 seconds.

// split array into 256 bins according to most significant byte
void RadixSort(uint32_t * a, size_t count)
{
size_t aIndex[260] = {0};               // count / array
uint32_t * b = new uint32_t [count];    // allocate temp array
size_t i;
    for(i = 0; i < count; i++)          // generate histogram
        aIndex[1+((size_t)(a[i] >> 24))]++;
    for(i = 2; i < 257; i++)            // convert to indices
        aIndex[i] += aIndex[i-1];
    for(i = 0; i < count; i++)          //  sort by msb
        b[aIndex[a[i]>>24]++] = a[i];
    for(i = 256; i; i--)                // restore aIndex
        aIndex[i] = aIndex[i-1];
    aIndex[0] = 0;
    for(i = 0; i < 256; i++)            // radix sort the 256 bins
        RadixSort3(&b[aIndex[i]], &a[aIndex[i]], aIndex[i+1]-aIndex[i]);
    delete[] b;
}

// sort a bin by 3 least significant bytes
void RadixSort3(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[3][256] = {0};            // count / matrix
size_t i,j,m,n;
uint32_t u;
    if(count == 0)
        return;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 3; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 3; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 3; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}

Example code for classic lsb radix sorts:

Example C++ lsb radix sort using 8,8,8,8 bit fields:

typedef unsigned int uint32_t;

void RadixSort(uint32_t * a, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
uint32_t * b = new uint32_t [count];    // allocate temp array
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    delete[] b;
}

Example C++ code using 16,16 bit fields:

typedef unsigned int uint32_t;

uint32_t * RadixSort(uint32_t * a, size_t count)
{
size_t mIndex[2][65536] = {0};          // count / index matrix
uint32_t * b = new uint32_t [count];    // allocate temp array
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 2; j++){
            mIndex[j][(size_t)(u & 0xffff)]++;
            u >>= 16;
        }
    }
    for(j = 0; j < 2; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 65536; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }
    }       
    for(j = 0; j < 2; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<4))&0xffff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    delete[] b;
    return(a);
}

Thanks for the info, I tried benchmarking 11,11,10 method, I don't know why it turned out to be slower than **EDIT4**, same was 16 bit array, it could be as you say they are faster for larger arrays (> 100000), feel free to upload your code, will check it out — ytoamn, Jul 01 '17 at 06:50
yeah C / C++ and java all 3 are welcome, if you have ample time then feel free to upload all 3 :) — ytoamn, Jul 02 '17 at 07:58
Thanks a lot, to me it seems like there is some scope for few micro-optimizations, and "COUNT" means "count" — ytoamn, Jul 03 '17 at 20:58
@ytoamn - I tried your edit4 version, and it was about 2% faster than the simple 8,8,8,8 version in my answer, reducing the time from 0.450 to 0.441 seconds. This is what I meant by unfolding / optimizing the `j` loops not making much difference. Most of the time spent in radix sort is due to the random location of the writes to the array during each radix sort pass. In the case of 16 million 32 bit integers, that's 64MB of data written in random order, with 8MB cache on my system. As array size increases, the `j` optimizations have less relative effect. — rcgldr, Jul 04 '17 at 19:28
Its really nice to know all that, thanks a lot :) , the nanoscopic slowdown would probably be due to setting up the short-circuit-path, which can skip the random memory writes of sorting step if all tested LSBs are same. Thats something which I have never seen earlier in books or the web or being taught anywhere. Another step closer to the state-of-the-art radix sort. Thanks a lot. — ytoamn, Jul 05 '17 at 15:06
@ytoamn - I updated my answer to include a hybrid msb / lsb example. — rcgldr, Jul 06 '17 at 02:55
thanks a lot, will try it on my system, meanwhile have a look at EDIT 5, it is something big, really big, feel free to share your views :) — ytoamn, Jul 06 '17 at 20:04
@ytoamn - I deleted my prior comments (no longer needed). My examples were for unsigned integers, which Java doesn't have as primitives. The signed version in EDIT 5 looks OK. — rcgldr, Jul 06 '17 at 20:29
Hi @rcgldr, I was wondering why you use a matrix of counts when 4 arrays of 256 length are quite faster. All my simulations and tests show that 4 separate arrays are roughly 20% faster than matrix of counts. I think the reason is the address calculation is done by row-major order formula which can be slower than address calculation in 4 separate arrays. Please enlighten me if I'm going wrong somewhere. — ytoamn, Jul 08 '17 at 21:29
@ytoamn - As commented above, using separate arrays was only 2% faster with my testing, larger arrays (16+ million 32 bit integers) much greater than the 8MB L3 cache size on my system, so the key issue is the random order writes that don't fit in cache. For an array of 1 million 32 bit integers, both original and temp arrays fit in cache, and separate arrays is about 8.5% faster, from 0.0128 seconds to 0.0118 seconds, but this is about 1/80th of a second (0.0125), where I wasn't concerned about performance. I used a matrix to keep the examples simple. — rcgldr, Jul 09 '17 at 17:00
@ytoamn - The hybrid msb/lsb radix sort produced the most improvement. For array size of 1 million 32 bit integers, it took 0.0102 seconds (versus 0.0118 for separate arrays or 0.0129 for matrix). This is probably because the 256 pairs of bins mostly fit in L1 and L2 cache. — rcgldr, Jul 09 '17 at 20:53
It seems surprising, the hybrid msb-lsb took 0.150005 seconds on average to sort 2mil numbers, but my pointer optimized version took 0.038876 seconds on average, so it might be something else other than code which is responsible for these variations, I appreciate your help a lot, thanks +1, feel free to reach out anytime. — ytoamn, Jul 10 '17 at 17:01
@ytoamn - the timing is sensitive to the data. The hybrid msb-lsb requires near uniform data (for the most significant byte) in order to be faster. I'm using VS rand() function 4 times, one for each byte of a 32 bit integer, for the random data (this was based on legacy testing of older sort programs). Also I'm testing with Windows / C++. I could port the code to java, which will be slower, but relative performance should be similar. On my system with VS and C++, for 2 million integers, it takes 0.0320 seconds with separate arrays, and 0.0236 seconds with hybrid msb-lsb. — rcgldr, Jul 10 '17 at 17:55

Antonín Lejsek · Answer 2 · 2017-05-02T23:33:24.103

N & 15 , N & 31 , N & 63 .... and so on , which of these bitwise operations takes least time?

They are same. Do not take it bad, but optimizing for speed without knowing how long things last may end up quite bad. And even when you know the timing, hardware is very complicated nowadays and quite unpredictable. You program in java, that is another layer of insanely complex system. The same code may be faster today and slower tomorrow. Your say approximately 2.232891909840167 times faster. In reality, you have measurement on one hardware and software configuration with one set of data and you can only hope the measurement is representative enough. Unfortunately, it is not always the case.

I rewrote your function. It is shorter and simpler, yet does not seem to be slower. Compilers tend to like code that is not too clever, as there are many optimizations for simple cases. The correction for negative numbers is not particulary nice, you can delete it if you do not like it. It seems to work best for 8 bits and 11 bits, probably due to cache sizes, have a look at comments of rcgldr.

EDIT

@ytoamn you are right, if all is in the first bucket the loop should continue, not break. That was a bug. To the other changes, I would rather avoid the contract you have done now. I think there are three natural contracts for sorting function. First one is sorting the original array and returning null. Second is sorting the original array and return it. The third is returning new sorted array and keeping the original array intact. I like the first one, as its behaviour is unambiguous. The way you have it now you should add big warning to the documentation, that the original array has changed and is returned from the function is some cases and in other not. Second thing I would avoid is the old C code style. You should define loop variable in the loop if you need it only there. Defining it globally injects dependency that may lead to bugs. And it has no advantages here, as properly defined loop variables would share the space in the end anyway. Compiler is well aware of the scope, you should use the smallest scope you need.

EDIT2

Feel free to comment directly under my post :-) Local variables are just addresses on the stack. You allocate memory when constructing object which is not the case here. As for the array, think about this code:

public static void Tst(int[] A) {
    int[] tmp = new int[A.length];
    A[0] = 6;
    A = tmp; // changes what parameter A contains
    A[0] = 7;
}

public static void main(String[] args) {
    int[] A = new int[1];
    A[0] = 5;
    Tst(A);
    System.out.println(A[0]); //prints 6
}

It prints 6. Number 7 is written into tmp array only. Array A in main is not affected.

protected static void ASC2(int A[], int bits) {
    int[] origA = A;
    int[] tmp = new int[A.length];
    int[] Z = new int[1 << bits];
    int mask = (1 << bits) - 1;

    for (int shift = 0; shift < 32; shift += bits) {

        if (shift > 0) {
            Arrays.fill(Z, 0);
        }
        for (int i = 0; i < A.length; ++i) {
            Z[(A[i] >> shift) & mask]++;
        }

        if (Z[0] == A.length) {
            continue; // all in first bucket
        }

        Z[Z.length - 1] = A.length - Z[Z.length - 1];
        for (int i = Z.length - 2; i >= 0; --i) {
            Z[i] = Z[i + 1] - Z[i];
        }

        if (shift + bits > 31) { // negative numbers correction
            int halfLength = Z.length / 2;
            int positSum = Z[halfLength];
            int negSum = A.length - positSum;
            if (negSum > 0) {
                for (int i = 0; i < halfLength; ++i) {
                    Z[i] += negSum;
                }
                for (int i = halfLength; i < Z.length; ++i) {
                    Z[i] -= positSum;
                }
            }
        }

        for (int i = 0; i < A.length; ++i) {
            tmp[Z[(A[i] >> shift) & mask]++] = A[i];
        }

        int[] swap = A;
        A = tmp;
        tmp = swap;
    }

    if (A != origA) {
        System.arraycopy(A, 0, origA, 0, A.length);
    }
}

EDIT3

Loop unroll is a valid technique, improving short circuiting is really nice. But with using array lengths as constants you definitely start to be too clever. If you hard coded the base size, why not hard code it all like this:

protected static int[] DSC2(int A[])// sorts in descending order
{
    int tmp[] = new int[A.length];
    int Z[] = new int[256];
    int sample, swap[];

    // 1st LSB byte extraction
    sample = A[0] & 255;
    for (int i = 0; i < A.length; ++i) {
        Z[A[i] & 255]++;
    }

    if (Z[sample] != A.length) {
        Z[0] = A.length - Z[0];
        for (int i = 1; i < Z.length; ++i) {
            Z[i] = Z[i - 1] - Z[i];
        }

        for (int i = 0; i < A.length; ++i) {
            tmp[Z[A[i] & 255]++] = A[i];
        }

        swap = A;
        A = tmp;
        tmp = swap;
        Arrays.fill(Z, 0);
    } else {
        Z[sample] = 0;
    }

    // 2nd LSB byte extraction
    sample = (A[0] >> 8) & 255;
    for (int i = 0; i < A.length; ++i) {
        Z[(A[i] >> 8) & 255]++;
    }

    if (Z[sample] != A.length) {
        Z[0] = A.length - Z[0];
        for (int i = 1; i < Z.length; ++i) {
            Z[i] = Z[i - 1] - Z[i];
        }

        for (int i = 0; i < A.length; ++i) {
            tmp[Z[(A[i] >> 8) & 255]++] = A[i];
        }

        swap = A;
        A = tmp;
        tmp = swap;
        Arrays.fill(Z, 0);
    } else {
        Z[sample] = 0;
    }

    // 3rd LSB byte extraction
    sample = (A[0] >> 16) & 255;
    for (int i = 0; i < A.length; ++i) {
        Z[(A[i] >> 16) & 255]++;
    }

    if (Z[sample] != A.length) {
        Z[0] = A.length - Z[0];
        for (int i = 1; i < Z.length; ++i) {
            Z[i] = Z[i - 1] - Z[i];
        }

        for (int i = 0; i < A.length; ++i) {
            tmp[Z[(A[i] >> 16) & 255]++] = A[i];
        }

        swap = A;
        A = tmp;
        tmp = swap;
        Arrays.fill(Z, 0);
    } else {
        Z[sample] = 0;
    }

    // 4th LSB byte extraction
    sample = (A[0] >> 24) & 255;
    for (int i = 0; i < A.length; ++i) {
        Z[(A[i] >> 24) & 255]++;
    }

    if (Z[sample] != A.length) {
        Z[0] = A.length - Z[0];
        for (int i = 1; i < Z.length; ++i) {
            Z[i] = Z[i - 1] - Z[i];
        }

        for (int i = 0; i < A.length; ++i) {
            tmp[Z[(A[i] >> 24) & 255]++] = A[i];
        }

        A = tmp;
    }

    return A;
}

I made changes to my main code to reflect the latest developments, feel free to share your views on it. — ytoamn, May 02 '17 at 05:51
Thats nice, I also did that and ran simulations, but, there was no clear winner among these 2, your hardcoded version seemed great with arrays at small and medim lengths, but then it lost to the mine once array sizes were beyond 1000, however the difference between the 2 variants was only few microseconds. Results might come out different for someone else. I would also like to ask that why you have not used a variable for A.length too, is array length calculation faster than a variable access time. The results said so, performance dropped when A.length was replaced by N, so also Z.length. — ytoamn, May 03 '17 at 09:56
What have you done has truly been a great help to me, i got a code that runs 10 times fast, even beating the famous comparison sorts, really I didnot believe earlier it was possible. I will try out the other suggestions mentioned here, and then might multi-thread the best one if its possible, but right now I think I got what I was looking for. Thanks a lot — ytoamn, May 03 '17 at 10:03
@ytoamn I think I am done here. It would be nice if you could paste your fastest version as an answer and mark it as accepted (you can accept your own aswer). If you get into trouble with the multithreaded version, you can ask another question and we may meet again :-) — Antonín Lejsek, May 06 '17 at 00:47
Thanks, I will do as said, still trying to figure out the histogram way, I just wanted to know why using Array.lengths in for loops in condition statement, speeded up the code, while using constants slowed it. e.g. the code uses Z.length for frequency calculation, using 256 performance drops. Same with A.length. — ytoamn, May 06 '17 at 05:58
@ytoamn Like in `for (int i = 0; i < A.length; ++i)`? There should be an optimization, that skips range check for `A[i]`, as it is guaranteed to be within the array. — Antonín Lejsek, May 06 '17 at 12:59

Radix Sort Optimization

2 Answers2

Linked