1

I was wondering what is the best way to store a 256 bit long AVX vectors into 4 64 bit unsigned long integers. According to the functions written in the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I could only figure out using maskstore(code below) to do this. But is it the best way to do so? Or there exist other methods for this?

#include <immintrin.h>
#include <stdio.h>

int main() {

    unsigned long long int i,j;
    unsigned long long int bit[32][4];//256 bit random numbers
    unsigned long long int bit_out[32][4];//256 bit random numbers for test

    for(i=0;i<32;i++){ //load with 64 bit random integers
        for(j=0;j<4;j++){
            bit[i][j]=rand();
            bit[i][j]=bit[i][j]<<32 | rand();
        }
    }

//--------------------load masking-------------------------
    __m256i v_bit[32];
    __m256i mask;
    unsigned long long int mask_ar[4];
    mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
    mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);
//--------------------load masking ends-------------------------

//--------------------------load the vectors-------------------
    for(i=0;i<32;i++){

        v_bit[i]=_mm256_loadu_si256 ((__m256i const *)bit[i]);

    }
//--------------------------load the vectors ends-------------------

//--------------------------extract from the vectors-------------------
    for(i=0;i<32;i++){

        _mm256_maskstore_epi64 (bit_out[i], mask, v_bit[i]);
    }
//--------------------------extract from the vectors end-------------------

    for(i=0;i<32;i++){ //load with 64 bit random integers
        for(j=0;j<4;j++){
            if(bit[i][j]!=bit_out[i][j])
                printf("----ERROR----\n");
        }
    }

  return 0;
}
Amiri
  • 2,417
  • 1
  • 15
  • 42
Rick
  • 361
  • 5
  • 17
  • 2
    Best way is not to. `unsigned long` is not guaranteed to have 64 bits. If you need a specific bitwidth (and encoding), use fixed-width types from `stdint.h`. – too honest for this site Mar 13 '17 at 12:30
  • 1
    Maybe you should have a look at the `extract`, `set` and `insert` intrinsics. I have no idea what you are trying to do. – Christoph Diegelmann Mar 13 '17 at 12:44
  • @Christoph I just want to extract 256 vector in in 4, 64 bit integers. I didn't find the intrinsics you mentioned in the above mentioned page. – Rick Mar 13 '17 at 13:36
  • 2
    If the destination 64 bit ints are contiguous then just use `_mm256_storeu_si256`. – Paul R Mar 13 '17 at 13:37
  • In C11, use `_Alignas(32) unsigned long long int bit[32][4];` to get the compiler to align the stack memory for your array. This helps with performance even if you still use `_mm256_storeu_ps`. – Peter Cordes Jul 14 '17 at 04:16

1 Answers1

1

As other said in comments you do not need to use mask store in this case. the following loop got no error in your program

for(i=0;i<32;i++){
   _mm256_storeu_si256 ((__m256i const *) bit_out[i], v_bit[i]);

}

So the best instruction that you are looking for is _mm256_storeu_si256 this instruction stores a __m256i vector to unaligned address if your data are aligned you can use _mm256_store_si256. to see your vectors values you can use this function:

#include <stdalign.h>
alignas(32) unsigned long long int tempu64[4];
void printVecu64(__m256i vec)
{
    _mm256_store_si256((__m256i *)&tempu64[0], vec);
    printf("[0]= %u, [1]=%u, [2]=%u, [3]=%u \n\n", tempu64[0],tempu64[1],tempu64[2],tempu64[3]) ;

}

the _mm256_maskstore_epi64 let you choose the elements that you are going to store to the memory. This instruction is useful when you want to store a vector with more options to store an element to the memory or not change the memory value.

I was reading the Intel 64 and IA-32 Architectures Optimization Reference Manual (248966-032), 2016, page 410. and interestingly found out that unaligned store is still a performance killer.

11.6.3 Prefer Aligned Stores Over Aligned Loads

There are cases where it is possible to align only a subset of the processed data buffers. In these cases, aligning data buffers used for store operations usually yields better performance than aligning data buffers used for load operations. Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150 cycles. Loads that cross a page boundary are executed at retirement. In Example 11-12, unaligned store address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned case.

I shared here because some people said there are no differences between aligned/unaligned store except in debuging!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Amiri
  • 2,417
  • 1
  • 15
  • 42
  • 1
    `_mm256_store_si256` has no advantage over `_mm256_storeu_si256` except maybe for debugging if you expect memory to be aligned and it's not then `_mm256_store_si256` would crash your code. – Z boson Mar 16 '17 at 09:22
  • @Zboson sadly ICC emits a unaligned load even if you use `_mm256_store_si256`. This has caused me some grey hairs when the code suddenly crashed with other compilers. – Christoph Diegelmann Mar 20 '17 at 11:56
  • 1
    @Christoph, i should have said the instructions instead of the intrinsics but using another compiler can also be a way useful for debugging. – Z boson Mar 20 '17 at 12:10
  • Fun fact: Skylake vastly reduces the 4k-split penalty, down to barely more than the cache-line split penalty. (I think this is correct for stores as well as loads). – Peter Cordes Aug 21 '17 at 02:18
  • yes, I never experiences penalties for unaligned loads in SKL. But, processor vendors are some killers... – Amiri Aug 21 '17 at 16:46
  • Update on this: `_mm256_storeu_si256` vs. `_mm256_store_si256` can make your code slower with GCC, even if your data is aligned. [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726). If your data is aligned, there's no downside to using unaligned load/store *instructions* on any CPU that supports AVX, but GCC with some tuning options will split unaligned 256-bit loads and/or stores into `vmovups xmm` / `vextractf128` (not a single insn), hurting performance on aligned arrays. That's avoided on Intel Haswell and later by using `-march=native` – Peter Cordes Sep 23 '20 at 16:02