5

I am trying to use SSE2 to unpack text with zeros, and extend that to AVX2. Here's what I mean:

Suppose you have some text like this: abcd

I'm trying to use SSE2 to unpack abcd into a\0b\0c\0d. The \0's are zeros. This of course being applied to 16 characters instead of 4.

I was able to do that using this code (ignore the C-Style casts):

__m128i chunk = _mm_loadu_si128((__m128i const*) src); // Load 16 bytes from memory

__m128i half = _mm_unpacklo_epi8(chunk, _mm_setzero_si128()); // Unpack lower 8 bytes with zeros
_mm_storeu_si128((__m128i*) dst, half); // Write to destination

half = _mm_unpackhi_epi8(chunk, _mm_setzero_si128()); // Unpack higher 8 bytes with zeros
_mm_storeu_si128((__m128i*) (dst + 16), half); // Write to destination

This works great, but I'm trying to convert the code into AVX2, so I can process 32 bytes at a time. However, I'm having trouble with unpacking the low bytes.

Here is the code I'm using for AVX2:

__m256i chunk = _mm256_loadu_si256((__m256i const*) src); // Load 32 bytes from memory

__m256i half = _mm256_unpacklo_epi8(chunk, _mm256_setzero_si256()); // Unpack lower 16 bytes with zeros
_mm256_storeu_si256((__m256i*) dst, half); // Write to destination

half = _mm256_unpackhi_epi8(chunk, _mm256_setzero_si256()); // Unpack higher 16 bytes with zeros
_mm256_storeu_si256((__m256i*) (dst + 32), half); // Write to destination

The problem is, the _mm256_unpacklo_epi8 instruction seems to be skipping 8 bytes for every 8 bytes it converts. For example this text (the "fr" at the end is intended):

Permission is hereby granted, fr

Gets converted into

Permissireby graon is hented, fr

Every 8 bytes _mm256_unpacklo_epi8, processes, 8 bytes get skipped.

What am I doing wrong here? Any help would be greatly appreciated.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 4
    `_mm256_unpacklo_epi8` operates in-lane. Either use `vpmovzxbw` (`_mm256_cvtepu8_epi16`) or do a `vpermq` to fix up the unpack results. (Or interleave your loads with `_mm_loadu_si128` / `_mm256_insertf128_si256`) – Peter Cordes Aug 17 '21 at 19:10
  • @PeterCordes But `_mm256_cvtepu8_epi16` takes in a __m128i, instead of a __m256i. Do I instead do a load twice? – whatisgoingon Aug 17 '21 at 19:13
  • Yeah, load in 128-bit chunks. Modern CPUs have twice as many load execution units as store execution units. (Except Ice Lake which adds a 2nd store unit.) And 1 load per store is a find balance. – Peter Cordes Aug 17 '21 at 19:14
  • @PeterCordes Oh ok. I'm gonna try this out, thanks! – whatisgoingon Aug 17 '21 at 19:15
  • 4
    I think this kind of question has come up before so I was looking for a duplicate. [unexpected \_mm256\_shuffle\_epi with \_\_256i vectors](https://stackoverflow.com/q/46582438) covers why this happens, but doesn't mention the good workarounds. `ASCIIrev32B` in [this answer](https://stackoverflow.com/a/61181913/224132) shows a variation of the 2x `_mm_loadu_si128` => vinserti128 setup for an in-lane shuffle. – Peter Cordes Aug 17 '21 at 19:23
  • @PeterCordes Thanks Peter for all the help! – whatisgoingon Aug 18 '21 at 17:08

1 Answers1

2

As I can see the right answer already has been received from @PeterCordes. Nevertheless I want to supplement it with small helper function:

template <int part> inline __m256i Cvt8uTo16u(__m256i a)
{
    return _mm256_cvtepu8_epi16(_mm256_extractf128_si256(a, part));
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • Have you confirmed compilers are optimizing away `_mm256_extractf128_si256( a, 0 )`? If they aren’t, easy to do manually, with `if constexpr( part == 0 )` and `_mm256_castsi256_si128` – Soonts Aug 20 '21 at 00:14