Efficient parallel byte-wise computation of predicate "less than or equal" for signed integers

Question

In various contexts, such as bioinformatics, computations on byte-size integers is sufficient. For best performance, many processor architectures offer SIMD instruction sets (e.g. MMX, SSE, AVX) which partition registers into byte-, halfword-, and word-sized components, then perform arithmetic, logical, and shift operations individually on corresponding components.

However, some architecture do not offer such SIMD instructions, requiring them to be emulated, which often requires a significant amount of bit-twiddling. At the moment, I am looking at SIMD comparisons, and in particular the parallel comparison of signed, byte-sized, integers. I have a solution that I think is quite efficient using portable C code (see the function vsetles4 below). It is based on an observation made in year 2000 by Peter Montgomery in a newsgroup posting, that (A+B)/2 = (A AND B) + (A XOR B)/2 without overflow in intermediate computation.

Can this particular emulation code (function vsetles4) be accelerated further? To first order any solution with a lower count of basic operations would qualify. I am looking for solutions in portable ISO-C99, without use of machine-specific intrinsics. Most architectures support ANDN (a & ~b), so this may be assumed to be available as a single operation in terms of efficiency.

#include <stdint.h>

/*
   vsetles4 treats its inputs as arrays of bytes each of which comprises
   a signed integers in [-128,127]. Compute in byte-wise fashion, between
   corresponding bytes of 'a' and 'b', the boolean predicate "less than 
   or equal" as a value in [0,1] into the corresponding byte of the result.
*/

/* reference implementation */
uint32_t vsetles4_ref (uint32_t a, uint32_t b)
{
    uint8_t a0 = (uint8_t)((a >>  0) & 0xff);
    uint8_t a1 = (uint8_t)((a >>  8) & 0xff);
    uint8_t a2 = (uint8_t)((a >> 16) & 0xff);
    uint8_t a3 = (uint8_t)((a >> 24) & 0xff);
    uint8_t b0 = (uint8_t)((b >>  0) & 0xff);
    uint8_t b1 = (uint8_t)((b >>  8) & 0xff);
    uint8_t b2 = (uint8_t)((b >> 16) & 0xff);
    uint8_t b3 = (uint8_t)((b >> 24) & 0xff);
    int p0 = (int32_t)(int8_t)a0 <= (int32_t)(int8_t)b0;
    int p1 = (int32_t)(int8_t)a1 <= (int32_t)(int8_t)b1;
    int p2 = (int32_t)(int8_t)a2 <= (int32_t)(int8_t)b2;
    int p3 = (int32_t)(int8_t)a3 <= (int32_t)(int8_t)b3;

    return (((uint32_t)p3 << 24) | ((uint32_t)p2 << 16) |
            ((uint32_t)p1 <<  8) | ((uint32_t)p0 <<  0));
}

/* Optimized implementation:

   a <= b; a - b <= 0;  a + ~b + 1 <= 0; a + ~b < 0; (a + ~b)/2 < 0.
   Compute avg(a,~b) without overflow, rounding towards -INF; then
   lteq(a,b) = sign bit of result. In other words: compute 'lteq' as 
   (a & ~b) + arithmetic_right_shift (a ^ ~b, 1) giving the desired 
   predicate in the MSB of each byte.
*/
uint32_t vsetles4 (uint32_t a, uint32_t b)
{
    uint32_t m, s, t, nb;
    nb = ~b;            // ~b
    s = a & nb;         // a & ~b
    t = a ^ nb;         // a ^ ~b
    m = t & 0xfefefefe; // don't cross byte boundaries during shift
    m = m >> 1;         // logical portion of arithmetic right shift
    s = s + m;          // start (a & ~b) + arithmetic_right_shift (a ^ ~b, 1)
    s = s ^ t;          // complete arithmetic right shift and addition
    s = s & 0x80808080; // MSB of each byte now contains predicate
    t = s >> 7;         // result is byte-wise predicate in [0,1]
    return t;
}

@rici Are there faster solutions available? I will edit so the question actually contains a question mark. — njuffa, Feb 02 '17 at 23:38
Have you actually timed this to be any faster than going through an array of `int8_t`s and applying `<=`? (Not timed relative to `vsetles4_ref` - timed relative to not trying to pack these things into `uint32_t`s at all.) — user2357112, Feb 02 '17 at 23:52
@user2357112 That is a valid point, but no, I haven't tried. That is not particularly convenient in context: the data is already stored packed and aligned for best memory performance, not as an array of `int8_t`, and processed using SIMD intrinsics on platforms that support them. Carrying each byte separately in a register would cause a significant increase in register pressure during computations. Note that the 32-bit operations used here would morph to 64-bit operations on 64-bit platforms, further amortizing overhead from the "packed" approach. — njuffa, Feb 03 '17 at 00:28
If this code works and you just want suggestions on speeding it up, I'd suggest CodeReview.SE. — Nic, Feb 03 '17 at 01:15
@QPaysTaxes I was "showing my work" here (lest somebody think I am trying to pawn off work to the SO users), not looking for a review of my code ("you shouldn't use one-letter variable names" :-). I tackled this problem before and got down to 11 or 12 instructions some years ago, then I hit upon the improved code in this question as I was answering some other [question](http://stackoverflow.com/questions/41965494/carry-bits-in-incidents-of-overflow) the other day. — njuffa, Feb 03 '17 at 01:25
@AldwinCheung I originally looked into this for NVIDIA GPUs. The Kepler architecture has this operation (mostly) in hardware, following GPU architectures do not and need emulation. While at NVIDIA I created a largish set of such primitives ([here](https://nvlabs.github.io/nvbio/vs_2sse-test_2simd__functions_8h_source.html), BSD license). I have since retired, but a [recent question on SO](http://stackoverflow.com/questions/41965494/carry-bits-in-incidents-of-overflow) prompted me to re-visit this emulation, which is interesting independent of architecture (may also be useful for low-end ARM). — njuffa, Feb 03 '17 at 06:09
Since the question was put on hold as "unclear", I have rephrased and bolded the question. If it is still unclear, I would appreciate a comment as to why it is considered unclear, it is a straightforward algorithm-tuning question in my thinking. — njuffa, Feb 03 '17 at 16:25
So ask for a review with a focus on speed. CodeReview.SE isn't a bureaucracy; they'll happily focus on reducing the number of instructions this takes from [very low number] to [ridiculously low number]. — Nic, Feb 03 '17 at 18:32
@QPaysTaxes As far as I can discern, optimization questions are not off-topic here (a site I am very familiar with) so there should be no need to migrate to CodeReview.SE (a site I am not familiar with at all). Besides, the question was flagged as "unclear" not "off-topic", not clear why. I will leave things the way they are, this is part of my hobby and I have no pressing need to find a better solution, I am simply curious about potentially superior ways to code this. — njuffa, Feb 03 '17 at 19:06
Er, no, "how do I make this code better" is not on-topic here. That falls under "too broad" -- after all, SO is for specific questions. Code Review, which I am familiar with, is the best place for this question. — Nic, Feb 03 '17 at 19:07
@QPaysTaxes I'm also familiar with Code Review, and this question happens to be on a fine line in between. It's asking for improvement, but has a specific question. SO doesn't deal well with optimization and CR doesn't do well with specific questions. I can understand why the author decided to post it here and it's unfortunate the question got closed. — Mast, Feb 03 '17 at 19:10
I am probably expressing myself poorly because I am not asking "how can I make this code better" but "How can this functionality be achieved with fewer basic operations", which seems very specific to me. "fewer basic operations" is sufficient for speedup on the simple in-order processors I am interested in. Suggestions on how to rephrase the question to make it clear are welcome. — njuffa, Feb 03 '17 at 19:31

score 0 · Answer 1 · answered Feb 03 '17 at 01:20

To [possibly] save you some work and to answer user2357112's question, I've created a [crude] benchmark for this. I added the byte at a time as a base reference:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>

long opt_R;
long opt_N;

void *aptr;
void *bptr;
void *cptr;

/*
   vsetles4 treats its inputs as arrays of bytes each of which comprises
   a signed integers in [-128,127]. Compute in byte-wise fashion, between
   corresponding bytes of 'a' and 'b', the boolean predicate "less than
   or equal" as a value in [0,1] into the corresponding byte of the result.
*/

/* base implementation */
void
vsetles4_base(const void *va, const void *vb, long count, void *vc)
{
    const char *aptr;
    const char *bptr;
    char *cptr;
    long idx;

    count *= 4;
    aptr = va;
    bptr = vb;
    cptr = vc;

    for (idx = 0;  idx < count;  ++idx)
        cptr[idx] = (aptr[idx] <= bptr[idx]);
}

/* reference implementation */
static inline uint32_t
_vsetles4_ref(uint32_t a, uint32_t b)
{
    uint8_t a0 = (uint8_t)((a >>  0) & 0xff);
    uint8_t a1 = (uint8_t)((a >>  8) & 0xff);
    uint8_t a2 = (uint8_t)((a >> 16) & 0xff);
    uint8_t a3 = (uint8_t)((a >> 24) & 0xff);
    uint8_t b0 = (uint8_t)((b >>  0) & 0xff);
    uint8_t b1 = (uint8_t)((b >>  8) & 0xff);
    uint8_t b2 = (uint8_t)((b >> 16) & 0xff);
    uint8_t b3 = (uint8_t)((b >> 24) & 0xff);

    int p0 = (int32_t)(int8_t)a0 <= (int32_t)(int8_t)b0;
    int p1 = (int32_t)(int8_t)a1 <= (int32_t)(int8_t)b1;
    int p2 = (int32_t)(int8_t)a2 <= (int32_t)(int8_t)b2;
    int p3 = (int32_t)(int8_t)a3 <= (int32_t)(int8_t)b3;

    return (((uint32_t)p3 << 24) | ((uint32_t)p2 << 16) |
            ((uint32_t)p1 <<  8) | ((uint32_t)p0 <<  0));
}

uint32_t
vsetles4_ref(uint32_t a, uint32_t b)
{

    return _vsetles4_ref(a,b);
}

/* Optimized implementation:
   a <= b; a - b <= 0;  a + ~b + 1 <= 0; a + ~b < 0; (a + ~b)/2 < 0.
   Compute avg(a,~b) without overflow, rounding towards -INF; then
   lteq(a,b) = sign bit of result. In other words: compute 'lteq' as
   (a & ~b) + arithmetic_right_shift (a ^ ~b, 1) giving the desired
   predicate in the MSB of each byte.
*/
static inline uint32_t
_vsetles4(uint32_t a, uint32_t b)
{
    uint32_t m, s, t, nb;
    nb = ~b;            // ~b
    s = a & nb;         // a & ~b
    t = a ^ nb;         // a ^ ~b
    m = t & 0xfefefefe; // don't cross byte boundaries during shift
    m = m >> 1;         // logical portion of arithmetic right shift
    s = s + m;          // start (a & ~b) + arithmetic_right_shift (a ^ ~b, 1)
    s = s ^ t;          // complete arithmetic right shift and addition
    s = s & 0x80808080; // MSB of each byte now contains predicate
    t = s >> 7;         // result is byte-wise predicate in [0,1]
    return t;
}

uint32_t
vsetles4(uint32_t a, uint32_t b)
{

    return _vsetles4(a,b);
}

/* Optimized implementation:
   a <= b; a - b <= 0;  a + ~b + 1 <= 0; a + ~b < 0; (a + ~b)/2 < 0.
   Compute avg(a,~b) without overflow, rounding towards -INF; then
   lteq(a,b) = sign bit of result. In other words: compute 'lteq' as
   (a & ~b) + arithmetic_right_shift (a ^ ~b, 1) giving the desired
   predicate in the MSB of each byte.
*/
static inline uint64_t
_vsetles8(uint64_t a, uint64_t b)
{
    uint64_t m, s, t, nb;
    nb = ~b;            // ~b
    s = a & nb;         // a & ~b
    t = a ^ nb;         // a ^ ~b
    m = t & 0xfefefefefefefefell; // don't cross byte boundaries during shift
    m = m >> 1;         // logical portion of arithmetic right shift
    s = s + m;          // start (a & ~b) + arithmetic_right_shift (a ^ ~b, 1)
    s = s ^ t;          // complete arithmetic right shift and addition
    s = s & 0x8080808080808080ll; // MSB of each byte now contains predicate
    t = s >> 7;         // result is byte-wise predicate in [0,1]
    return t;
}

uint32_t
vsetles8(uint64_t a, uint64_t b)
{

    return _vsetles8(a,b);
}

void
aryref(const void *va,const void *vb,long count,void *vc)
{
    long idx;
    const uint32_t *aptr;
    const uint32_t *bptr;
    uint32_t *cptr;

    aptr = va;
    bptr = vb;
    cptr = vc;

    for (idx = 0;  idx < count;  ++idx)
        cptr[idx] = _vsetles4_ref(aptr[idx],bptr[idx]);
}

void
arybest4(const void *va,const void *vb,long count,void *vc)
{
    long idx;
    const uint32_t *aptr;
    const uint32_t *bptr;
    uint32_t *cptr;

    aptr = va;
    bptr = vb;
    cptr = vc;

    for (idx = 0;  idx < count;  ++idx)
        cptr[idx] = _vsetles4(aptr[idx],bptr[idx]);
}

void
arybest8(const void *va,const void *vb,long count,void *vc)
{
    long idx;
    const uint64_t *aptr;
    const uint64_t *bptr;
    uint64_t *cptr;

    count >>= 1;

    aptr = va;
    bptr = vb;
    cptr = vc;

    for (idx = 0;  idx < count;  ++idx)
        cptr[idx] = _vsetles8(aptr[idx],bptr[idx]);
}

double
tvgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_REALTIME,&ts);
    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

void
timeit(void (*fnc)(const void *,const void *,long,void *),const char *sym)
{
    double tvbeg;
    double tvend;

    tvbeg = tvgetf();
    fnc(aptr,bptr,opt_N,cptr);
    tvend = tvgetf();

    printf("timeit: %.9f %s\n",tvend - tvbeg,sym);
}

// fill -- fill array with random numbers
void
fill(void *vptr)
{
    uint32_t *iptr = vptr;

    for (long idx = 0;  idx < opt_N;  ++idx)
        iptr[idx] = rand();
}

// main -- main program
int
main(int argc,char **argv)
{
    char *cp;

    --argc;
    ++argv;

    for (;  argc > 0;  --argc, ++argv) {
        cp = *argv;
        if (*cp != '-')
            break;

        switch (cp[1]) {
        case 'R':
            opt_R = strtol(cp + 2,&cp,10);
            break;

        case 'N':
            opt_N = strtol(cp + 2,&cp,10);
            break;

        default:
            break;
        }
    }

    if (opt_R == 0)
        opt_R = 1;
    srand(opt_R);
    printf("R=%ld\n",opt_R);

    if (opt_N == 0)
        opt_N = 100000000;
    printf("N=%ld\n",opt_N);

    aptr = calloc(opt_N,sizeof(uint32_t));
    bptr = calloc(opt_N,sizeof(uint32_t));
    cptr = calloc(opt_N,sizeof(uint32_t));

    fill(aptr);
    fill(bptr);

    timeit(vsetles4_base,"base");
    timeit(aryref,"aryref");
    timeit(arybest4,"arybest4");
    timeit(arybest8,"arybest8");
    timeit(vsetles4_base,"base");

    return 0;
}

Here is the output of a run:

R=1
N=100000000
timeit: 0.550527096 base
timeit: 0.483014107 aryref
timeit: 0.236460924 arybest4
timeit: 0.147254944 arybest8
timeit: 0.440311432 base

Note that your reference is not too much faster than the byte-at-a-time [hardly worth the complexity, IMO].

Your optimized algorithm does provide the best performance, short of SIMD, and I extended it to use uint64_t, which [naturally] doubles the speed yet again.

It might be interesting for you to benchmark the SIMD versions, too. Just to prove that they really are the fastest.

Thanks for putting all that work. I do not currently have access to the hardware platform what has the hardware SIMD support (a particular GPU architecture), but based on my knowledge of the architecture, that hardware should be about twice the speed of my `vsetles4` emulation code here (throughput of SIMD instructions is 1/4 of regular ALU instructions, execution is in-order). My original code (BSD license) can be found [here](https://nvlabs.github.io/nvbio/vs_2sse-test_2simd__functions_8h_source.html). The interface is fixed since shipping software, and exactly as shown here. — njuffa, Feb 03 '17 at 06:02

Efficient parallel byte-wise computation of predicate "less than or equal" for signed integers

1 Answers1