1

I have a vector int16_t beta = {1,1,0,0,0,0,0,0}.

I want to implement this equation with AVX2

enter image description here

c[i] = a[i] + (-1)^beta[i] * b[i]

where a, b, c, and beta are all AVX2 vectors of int16_t.


I have figured out that, if I can map 1 to -32768 multiplication operation can be avoided. I mean, flipping the sign of vector b can be done using OR and NEGATE functions of simd intrinsics.

I do know that 1 can be mapped to -32768 using left shift operation, however avx2 doesn't have any bit shift operations1. Is there any way to efficiently map 1 to -32768 with simd?


Editor's footnote 1: _mm256_slli_epi16(x, 15) does in fact exist. But there are other ways to implement the whole formula so the question is interesting after all.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
yadhu
  • 1,253
  • 14
  • 25
  • 1
    I'm not sure if I completely understand your question, but maybe the answer is `_mm256_slli_epi16(x, 15)` ? – wim Aug 17 '18 at 08:46
  • How many bits do you want `beta` to have? 8 or 16? – Max Langhof Aug 17 '18 at 08:47
  • `beta` has 16 bits – yadhu Aug 17 '18 at 08:48
  • 1
    More `_mm256_blendv_epi8` could be of interest to blend between `b[i]` and `-b[i]`. – wim Aug 17 '18 at 08:51
  • @wim That appears to unfortunately operate on byte-sized chunks, not individual bits. – Max Langhof Aug 17 '18 at 08:57
  • _mm256_slli_epi16 instruction solved my problem. In addition to that it has very low latency for integers. thanks @wim – yadhu Aug 17 '18 at 08:58
  • If you didn't find shift intrinsics, you're probably looking in the wrong place. [A search for `shift` on Intel's intrinsics finder](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,2978&text=shift&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2) (with AVX512 filtered out to avoid clutter) finds them all. Or look at an asm reference guide http://felixcloutier.com/x86/index.html (which lists intrinsics.) See https://stackoverflow.com/tags/x86/info and https://stackoverflow.com/tags/sse/info for more links to guides/resources. – Peter Cordes Aug 17 '18 at 09:01
  • @MaxLanghof Yes, that is right, but with an extra `_mm256_cmpeq_epi16` you can expand the byte width to 16 bits. – wim Aug 17 '18 at 09:01
  • 1
    To negate a vector, subtract it from 0. `__m256i neg = _mm256_sub_epi16(_mm256_setzero_si256(), input_vector);` x86 doesn't have a SIMD negate instruction. @wim: you might blend `a[i]-b[i]` with `a[i]+b[i]`. Or not, I think Harold's answer is better. – Peter Cordes Aug 17 '18 at 09:05
  • If `beta` only has 1 useful bit per element, consider storing it more compactly to save memory bandwidth / cache footprint. Use [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/a/36491672) to turn a packed bitmap back into a SIMD vector. The extra instructions that takes are fast enough that you'll still bottleneck on memory bandwidth unless your data is hot in L1d or maybe L2, for this 3-input / 1-output operation, if you aren't doing anything else with `c` while it's already loaded. – Peter Cordes Aug 17 '18 at 09:12
  • @PeterCordes I did think of storing as packed bits, However I need to access bit by bit for extracting/interleaving so i decided to store as integers, may be I should try to find a way to efficiently perform those operations with packed bits. thanks for the suggestion and awesome editing of question – yadhu Aug 17 '18 at 09:52
  • 1
    On Intel CPUs, [`pext`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,2978,4071&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,Other&text=pext_) is efficient, so you can turn a vector of 16-bit compare results into a bitmap using `unsigned bitmap8 = _mm256_movemask_epi8` / `bitmap16 = _pext_u32(bitmap8, 0baaaaaaaa)`. Without BMI2 `pext`, you might just leave your bitmap interleaved with zeros, so you have 32 bits per 16 elements, with the low bit of every pair being garbage (or mask it to `0`, or make it always match the significant bit). – Peter Cordes Aug 17 '18 at 09:58

2 Answers2

4

There is a quick way to conditionally negate, using _mm256_sign_epi16. The mask is not of the right form, but it can be transformed into the right form by adding 0x7FFF to every element, so:

__m256i masks = _mm256_add_epi16(beta, _mm256_set1_epi16(0x7FFF));
__m256i res = _mm256_add_epi16(a, _mm256_sign_epi16(b, masks));
harold
  • 61,398
  • 6
  • 86
  • 164
  • But `masks` is negative or zero here, isn't it? While we need `masks` to be positive or negative? Nevertheless, `_mm256_sign_epi16` is a great idea in this context. – wim Aug 17 '18 at 09:18
  • @wim: masks is 0 or 1. So we get either `0x7FFF` (positive) or `0x8000` (negative). If `masks` came from a `cmpeq` or something and was 0 or all-ones (`-1` = `0xffff`), you'd want to add `0x8000` to get `0x7fff` or `0x8000`. Or subtract `masks` *from* that to get the opposite mapping. – Peter Cordes Aug 17 '18 at 09:21
  • @PeterCordes Yes, right, I overlooked that. Great answer! – wim Aug 17 '18 at 09:26
0

You can transform

c[i] = a[i] + (-1)^beta[i] * b[i]

to

c[i] = a[i] - (beta[i] AND b[i]) + ((NOT beta[i]) AND b[i])

as the original formula translates to "subtract this bit if beta[i] is set, otherwise add it".
(I have no idea what you intend to happen for c[i] = 0 + (-1) * 1 or c[i] = 1 + 1 * 1 - I'm assuming normal arithmetic addition with carries here, contrary to the index notation).
So you can just drop the indices:

c = a - (beta & b) + (!beta & b)

I don't know if that has a good mapping to AVX2 intrinsics but I would suspect it does.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72