2

There is a solution on how to find indexes of non-zero bytes using vector of 128 bits - https://stackoverflow.com/a/41959079/3648510.

Another solution (arr2ind_pext) demonstrates how to find non-zero bytes in a vector of 256 bits but returning indexes as 4 bytes integers - https://stackoverflow.com/a/41958528/3648510.

My original intention was to modify arr2ind_pext solution to return 8 bit instead of 32 bit indexes.

But now I think 32 bit might be OK, but what I want is to get a solution which will find indexes in 256 bit (or preferably 512 bits in two iterations) vector as fast as possible.

My current solution based on arr2ind_pext is here:

unsigned nonzeros(const __m256i _in, __m256i& _ivec, unsigned* _out)
{
    uint64_t cntr_const = 0xFEDCBA9876543210;
    __m256i  shft       = _mm256_set_epi64x(0x04,0x00,0x04,0x00);
    __m256i  vmsk       = _mm256_set1_epi8(0x0F);
    __m256i  shf_lo     = _mm256_set_epi8(
        0x80, 0x80, 0x80, 0x0B,  0x80, 0x80, 0x80, 0x03,  0x80, 0x80, 0x80, 0x0A,  0x80, 0x80, 0x80, 0x02,
        0x80, 0x80, 0x80, 0x09,  0x80, 0x80, 0x80, 0x01,  0x80, 0x80, 0x80, 0x08,  0x80, 0x80, 0x80, 0x00
    );
    __m256i  shf_hi     = _mm256_set_epi8(
        0x80, 0x80, 0x80, 0x0F,  0x80, 0x80, 0x80, 0x07,  0x80, 0x80, 0x80, 0x0E,  0x80, 0x80, 0x80, 0x06,
        0x80, 0x80, 0x80, 0x0D,  0x80, 0x80, 0x80, 0x05,  0x80, 0x80, 0x80, 0x0C,  0x80, 0x80, 0x80, 0x04
    );
    __m256i  pshufbcnst = _mm256_set_epi8(
        0x80, 0x80, 0x80, 0x80,  0x80, 0x80, 0x80, 0x80,  0x1E, 0x1C, 0x1A, 0x18,  0x16, 0x14, 0x12, 0x10,
        0x80, 0x80, 0x80, 0x80,  0x80, 0x80, 0x80, 0x80,  0x0E, 0x0C, 0x0A, 0x08,  0x06, 0x04, 0x02, 0x00
    );

    __m256i  msk        = _mm256_cmpeq_epi8(_in, _mm256_setzero_si256()); // Generate 32 bit mask
             msk        = _mm256_srli_epi64(msk, 4);                      // Pack 32x8 bit mask to 32x4 bit mask
             msk        = _mm256_shuffle_epi8(msk, pshufbcnst);           // Pack 32x8 bit mask to 32x4 bit mask
             msk        = _mm256_xor_si256(msk, _mm256_set1_epi8(-1));    // Invert 32x4 mask

    uint64_t m64_0 = _mm256_extract_epi64(msk, 0);
    uint64_t m64_1 = _mm256_extract_epi64(msk, 2);
    unsigned m64_count_0 = _mm_popcnt_u64(m64_0) >> 2;             // p is the number of nonzeros in 16 bytes of a
    unsigned m64_count_1 = _mm_popcnt_u64(m64_1) >> 2;             // p is the number of nonzeros in 16 bytes of a
    unsigned* out_0 = &_out[0];
    unsigned* out_1 = &_out[m64_count_0];

    auto f = [&](uint64_t msk64, unsigned* __restrict__  _out)
    {
        uint64_t cntr       = _pext_u64(cntr_const, msk64);           // parallel bits extract. cntr contains p 4-bit integers. The 16 4-bit integers in cntr_const are shuffled to the p 4-bit integers that we want

        // Unpack p 4-bit integers to p 32-bit integers
        __m256i  cntr256    = _mm256_set1_epi64x(cntr);
                 cntr256    = _mm256_srlv_epi64(cntr256, shft);
                 cntr256    = _mm256_and_si256(cntr256, vmsk);
        __m256i  cntr256_lo = _mm256_shuffle_epi8(cntr256, shf_lo);
        __m256i  cntr256_hi = _mm256_shuffle_epi8(cntr256, shf_hi);
                 cntr256_lo = _mm256_add_epi8(_ivec, cntr256_lo);
                 cntr256_hi = _mm256_add_epi8(_ivec, cntr256_hi);

        _mm256_storeu_si256((__m256i *)&_out[0], cntr256_lo);         // Note that the stores of iteration i and i+16 may overlap
        _mm256_storeu_si256((__m256i *)&_out[8], cntr256_hi);     // Array ind has to be large enough to avoid segfaults. At most 16 integers are written more than strictly necessary
    };

    f(m64_0, out_0);

    _ivec = _mm256_add_epi32(_ivec, _mm256_set1_epi32(16));
    f(m64_1, out_1);

    return m64_count_0 + m64_count_1;
}
Alexey R.
  • 183
  • 5
  • Can't you just pack the 32 bit indices from `arr2ind_pext` down to 8 bits ? – Paul R Aug 14 '17 at 12:15
  • Sure I can, but I want to know I still cannot do more work in a single iteration knowing the input vector is only 256 bits long. – Alexey R. Aug 14 '17 at 13:19
  • Why not at least start with that method and then see if it really is a performance bottleneck ? That might also give you insight into potential further optimisations. – Paul R Aug 14 '17 at 19:37
  • I am not sure if I did understand your question entirely. Is your question to extend [this answer](https://stackoverflow.com/a/41959079/3648510) from 128 bit to 256 bit? So, for example: if the input AVX2 register is `[in_31,in_30,...,in_1,in_0]=[0,5,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0, 4,0,0,0, 7,0,2,4]`, then do you expect the output to be `[out_31,out_30,...,out_1,out_0]=[-1,-1,...,-1, -1,-1,-1,30, 7,3,1,0]` ? Note that `arr2ind_pext`, in the second answer that you are referring to, also processes only 128 bit of data per iteration. – wim Aug 16 '17 at 12:39
  • 2
    A 256 bit version of function `nonz_index` is possible, but the extension from 128 bit to 256 bit is not trivial. Simply replacing `_mm` by `_mm256` doesn't work because the `_mm256` intrinsics don't cross the 128 bit lane boundaries. Moreover, the line `indx64 = _pext_u64(indx_const,msk64);` needs some attention to make it work for 32 indices instead of 16. It won't be too difficult to work around this, but that will take a couple of extra instructions. – wim Aug 16 '17 at 13:06

0 Answers0