0

I have a __m256i vector:

static char __attribute__((aligned(32))) str[32] = "Hello@@, This is my text !!!";
    __m256i vec_str=_mm256_load_si256((const __m256i*) str);

Now, based on a 32-bit integer bit-maks, I want to remove some characters from this str (vec_str) and move the later characters backward to fill the gap. For example, i want to remove @@, and !!! and this is my bit-mask (bit 1 => keep | bit 0 => remove)

int mask=0b11110001111111111111111100011111;

In fact, I want to delete the bytes where their bits (in bit-mask) are unset (0), copying later bytes to close the gap. What I expected:

vec_str="Hello This is my text";

I thought about indirect ways like separating the bit-mask into 2 16-bit masks and creating a table with 65536x16 elements for vpshufb (_mm256_shuffle_epi8) then using the table 2 times with memory-copy .... but I think this way is too heavy (too long). Is there any better direct way or a way without a copy in memory?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user216085
  • 69
  • 2
  • 8
  • 2
    Not much better; AVX-512 `vpcompressb` does exactly what you want, and it's hard to replicate that functionality with other instructions. For 8x 32-bit elements, it's possible to use `pext` and construct a shuffle vector for `vpermd` ([AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240)), but with 32 elements that's harder. 16x 4-bit indices is 64 bits so there's maybe some hope of doing it for 16 bytes at a time, creating a pshufb mask, with some extra unpacking work. (Without AVX-512 `vpermb`, you can't do all 32 bytes at once anyway.) – Peter Cordes Jan 12 '23 at 07:24
  • 1
    BTW, `alignas(32)` is the portable C++11 / C11 way to do what you're doing with GNU C `__attribute__((aligned((32)))`. – Peter Cordes Jan 12 '23 at 07:35
  • 1
    BTW, "not much better" isn't really right. You're correct that a huge array of `__m128i shuffles[65536]` would be a performance disaster most of the time, except maybe in a microbenchmark where commonly used shuffles could stay hot in L1d or L2 cache because the program isn't doing anything else. But it's hard to do it fast. – Peter Cordes Jan 12 '23 at 12:59
  • Possible duplicate: [Efficient sse shuffle mask generation for left-packing byte elements](https://stackoverflow.com/q/45506309) - 8 elements at a time for a table of 256x 16 bytes can be viable and still better than scalar. And the answers there propose ways to pack even better, trading ALU work vs. cache footprint. And yeah, Aki's answer uses `pext` like I'd initially thought of. If there's anything unique to this question that I haven't thought of, and someone has an answer they want to post here instead of on that earlier question, vote to reopen and let me know; I can reopen. – Peter Cordes Jan 12 '23 at 12:59
  • If the control mask is **required** to be a bitmap (presumably for post-processing using bit-twiddling) then take a look at how [simdjson](https://github.com/simdjson/simdjson)::minify works. – aqrit Jan 12 '23 at 20:59

0 Answers0