8

What I want to do is:

  1. Multiply the input floating point number by a fixed factor.
  2. Convert them to 8-bit signed char.

Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].

I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(

I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

Any help is welcome, thank you so much!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Amor Fati
  • 337
  • 2
  • 7
  • Is truncation ok? Use `_mm256_shuffle_epi8`. Otherwise use `pack(same,same)`, or better pack 4 vectors of floats down to 1 vector of `int8_t` in multiple steps: 2x epi32 and 1x epi16. (and then fix the in-lane ordering with a single `vpermq`). See [SSE - AVX conversion from double to char](https://stackoverflow.com/q/30853773) for an example using 128-bit `epi32` -> `epi8` – Peter Cordes Aug 10 '18 at 04:20
  • Lane-crossing correction is similar to the float->int16 case: [How can I convert a vector of float to short int using avx instructions?](https://stackoverflow.com/q/41228180). Bizzare, there are no hits on SO (other than this) for `_mm256_packs_epi16`, so no exact duplicates of this exist. – Peter Cordes Aug 10 '18 at 04:27
  • @PeterCordes Truncation is fine. BTW, can u tell me which solution is the fastest, is throughput an absolute standard? thx! – Amor Fati Aug 10 '18 at 04:31
  • You have multiple vectors of input floats, so 2x vpackssdw + 1x vpacksswb + 1x vpermd to produce 1 wide vector from 4 input vectors is better than 4x vpshufb + 4x vpermd + 4x stores. – Peter Cordes Aug 10 '18 at 04:33

2 Answers2

13

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)

Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.

Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

(Compiles nicely on the Godbolt compiler explorer).

Call this in a loop and _mm256_store_si256 the resulting vector.


(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)


With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.


If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.

Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)

Related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
-2

Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

new_learner
  • 131
  • 10
  • 2
    It's a question related with SIMD and AVX. – XQY Aug 10 '18 at 07:19
  • 1
    x86 has a machine instruction to convert `float` to integer (in fact it has multiple, for scalar vs. packed, and legacy x87). Doing it yourself with bit-manipulation would be slower than my answer, which converts 8 `floats` per core clock cycle on Haswell or Skylake. IDK if you're talking about printing a `float` to a decimal string, but this question is about converting them to `int8_t`. For converting to a decimal string, yes you normally do want to pick apart the exponent and significand. – Peter Cordes Aug 10 '18 at 07:40
  • I don't know such instruction and I jz started to learn this things , thats why according to my knowledge(at this state) , I posted this answer and I guarnatee that it will definitely work. – new_learner Aug 10 '18 at 11:03
  • 3
    When you posted it, the OP had already replied to my answer to confirm that *it* definitely works. If you're just learning, I suggest you read my answer and follow the links in it (including to Intel's intrinsics guide), and only post your own if you're confident yours is an improvement. And look at https://stackoverflow.com/tags/sse/info for some intro-to-SIMD stuff. – Peter Cordes Aug 11 '18 at 01:06