13

In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For example given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers.

Jack Nock
  • 14,703
  • 5
  • 22
  • 18

8 Answers8

8

There are several ways to do it, I'll just mention one:

SSE4

  • Use PMINUD and PMAXUD to separate the larger value in register #1, and the smaller value in register #2.
  • Subtract them.

MMX/SSE2

  • Flip the sign bit of the two values because the next instruction only accepts signed integer comparison.
  • PCMPGTD. Use this result as a mask.
  • Compute the results of both (a-b) and (b-a)
  • Use POR ( PAND ( mask, a-b ), PANDN ( mask, b-a ) ) to select the correct value for the absolute difference.
rwong
  • 6,062
  • 1
  • 23
  • 51
5

From tommesani.com, one solution for this problem is to use saturating unsigned subtraction twice.

As the saturating subtraction never goes below 0, you compute: r1 = (a-b).saturating r2 = (b-a).saturating

If a is greater than b, r1 will contain the answer, and r2 will be 0, and vice-versa for b>a. ORing the two partial results together will yield the desired result.

According to the VTUNE users manual, PSUBUSB/PSUBUSW is available for 128-bit registers, so you should be able to get a ton of parallelization this way.

Justin W
  • 476
  • 6
  • 16
  • 1
    sub with unsigned saturation appears to only be available for words or bytes, but luckily that's what I was looking for. This answer is very slightly better than the top-voted answer using SSE4.1 PMINU/PMAXU/PSUB, because `POR` can run on more ports than `PSUB` on some CPUs (Intel Haswell for example). – Peter Cordes Dec 24 '15 at 10:43
4
max(i,j) - min(i,j)
(i>j)*(i-j) + (j>i)*(j-i)

you can certainly use SSE registers, but compiler may do this for you anyways

Anycorn
  • 50,217
  • 42
  • 167
  • 261
3

SSE2:

Seems to be about the same speed as Phernost's second function. Sometimes GCC schedules it to be a full cycle faster, other times a little slower.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( a, b ); // get signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_andnot_si128( big, mask ); // mask = 0x7ffff... if negating
diff = _mm_xor_si128( diff, mask ); // 1's complement except MSB
diff = _mm_sub_epi32( diff, mask ); // add 1 and restore MSB

SSSE3:

Ever so slightly faster than previous. There is a lot of variation depending on how things outside the loop are declared. (For example, making a and b volatile makes things faster! It appears to be a random effect on scheduling.) But this is consistently fastest by a cycle or so.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( b, a ); // get reverse signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_xor_si128( mask, big ); // mask cannot be 0 for PSIGND insn
diff = _mm_sign_epi32( diff, mask ); // negate diff if needed

SSE4 (thx rwong):

Can't test this.

__m128i diff = _mm_sub_epi32( _mm_max_epu32( a, b ), _mm_min_epu32( a, b ) );
Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
1

One or more of the below will likely result in branchless code, depending on the machine and compiler, since the conditional expressions are all very simple.

I haven't been through all the sse answers but possibly some of the below are represented in the vector code; certainly all the below are vectorizable (if you have the unsigned compare to begin with, or fake it by toggling the msb first.). I thought it would be helpful to have some practical scalar answers to the question.

unsigned udiff( unsigned a, unsigned b )
{
      unsigned result = a-b;   // ok if a<b;
      if(a <b ) result = -result; 
      return result;
}
unsigned udiff( unsigned a, unsigned b )
{
      unsigned n =(a<b)? (unsigned)-1 : 0u;
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}


unsigned udiff( unsigned a, unsigned b )
{
      unsigned axb = a^b;
      if( a < b )  axb = 0;
      return (axb^b) - (axb^a);  // a-b, or b-a
}

This will work on x86_64 (or anything where 64-bit temps are basically free)

unsigned udiff( unsigned a, unsigned b )
{
      unsigned n= (unsigned)( 
         (long long)((unsigned long long)a - (unsigned long long)b)>>32 
                      ); // same n as 2nd example
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}
greggo
  • 3,009
  • 2
  • 23
  • 22
1

compute the difference and return the absolute value

__m128i diff = _mm_sub_epi32(a, b);  
__m128i mask = _mm_xor_si128(diff, a);
mask = _mm_xor_si128(mask, b);
mask = _mm_srai_epi32(mask, 31);
diff = _mm_xor_si128(diff, mask);  
mask = _mm_srli_epi32(mask, 31);  
diff = _mm_add_epi32(diff, mask);  

This requires one less operation that using the signed compare op, and produces less register pressure.

Same amount of register pressure as before, 2 more ops, better branch and merging of dependency chains, instruction pairing for uops decoding, and separate unit utilization. Although this requires a load, which may be out of cache. I'm out of ideas after this one.

__m128i mask, diff;
diff = _mm_set1_epi32(-1<<31); // dependency branch after
a = _mm_add_epi32(a, diff); // arithmetic sign flip
b = _mm_xor_si128(b, diff); // bitwise sign flip parallel with 'add' unit
diff = _mm_xor_si128(a, b); // reduce uops, instruction already decoded
mask = _mm_cmpgt_epi32(b, a); // parallel with xor
mask = _mm_and_si128(mask, diff); // dependency merge, branch after
a = _mm_xor_si128(a, mask); // if 2 'bit' units in CPU, parallel with next
b = _mm_xor_si128(b, mask); // reduce uops, instruction already decoded
diff = _mm_sub_epi32(a, b); // result

After timing each version with 2 million iterations on a Core2Duo, differences are immeasurable. So pick whatever is easier to understand.

Phernost
  • 1,143
  • 1
  • 12
  • 8
  • Is `sum` supposed to be `diff`? Bah, now that I've read yours closely it's quite similar to mine. But more clever, nice on using the signed difference as a signed comparison. Comparison with zero is generally lighter-weight than right-shifting, though. – Potatoswatter Aug 20 '10 at 00:08
  • Actually, we both made a mistake: in the first function, a three-input consensus function is needed, not three-way XOR. – Potatoswatter Aug 22 '10 at 01:57
0

Try this (assumes 2nd complements, which is OK judgning by the fact that you're asking for SSE):

int d = a-b;
int ad = ((d >> 30) | 1) * d;

Explanation: sign-bit (bit 31) gets propagated down to 1st bit. the | 1 part ensures that the multiplier is either 1 or -1. Multiplications are fast on modern CPUs.

zvrba
  • 24,186
  • 3
  • 55
  • 65
  • 2
    But the sign bit of a-b is not an indication that a < b. consider a=0xF0000000 and b = 1. If it were, you could use abs(a-b). – greggo Mar 21 '14 at 16:43
-1

Erm ... its pretty easy ...

int diff = abs( a - b );

Easily vectorisable (Using SSE3 as):

__m128i sseDiff = _mm_abs_epi32( _mm_sub_epi32( a, b ) );

a and b are unsigned integers. Consider a=0 and b=0xffffffff. The correct absolute difference is 0xffffffff, but your solution will give 1.

Goz
  • 61,365
  • 24
  • 124
  • 204
  • 2
    As the weird edit explained, this is wrong. Another example for 8-bit unsigned integers: For `4 - 255`, the correct absolute difference is 251. But treating it as signed -5 and taking the absolute value gives you 5, which is the same answer you get for 255 - 250. – Peter Cordes Oct 27 '17 at 08:43