3

Are there any better way to load unsigned char array to short using SSE? Like

unsigned char foo1[16];

__m128i foo2 = _mm_loadu_si128((__m128i*)foo1);

I want foo2 to store elements in the short int data type.

Paul R
  • 208,748
  • 37
  • 389
  • 560

1 Answers1

4

Not completely clear what you want.

But if you want SSE register with one short value per each input byte, then you probably need this (untested):

__declspec( align( 16 ) ) unsigned char foo1[ 16 ];
// Fill your array with data

const __m128i src = _mm_load_si128( ( __m128i* )foo1 );
const __m128i zero = _mm_setzero_si128();
const __m128i lower = _mm_unpacklo_epi8( src, zero );   // First 8 short values
const __m128i higher = _mm_unpackhi_epi8( src, zero );  // Last 8 short values
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 1
    SSE4.1 pmovzxbw to load 8B at a time is also an option, but it's [hard to safely get compilers to use it directly as a load with intrinsics](http://stackoverflow.com/a/34280492/224132). Unpacking low and high halves with zero works well for unsigned data. – Peter Cordes Apr 27 '16 at 02:37
  • @PeterCordes IMO unpacking with zeros should work slightly faster. Dual-channel RAM transfers 128 bits batches anyway, and XMM registers are faster to access than even L1 cache. – Soonts Apr 27 '16 at 04:21
  • 2
    On Haswell, you bottleneck on the shuffle port either, at one unpacked result per cycle. Both loads come from the same cache line. SnB-family CPUs can sustain two loads per clock. SnB/IvB should be able to sustain two `pmovzxbw xmm, [mem]` per clock since they have two 128b shuffle units, esp. if the memory address doesn't use an indexed addressing mode (so it can micro-fuse). Anyway, 2x pmovzx from memory will be better if your data is 8B-aligned but not 16B-aligned. It's also fewer fused-domain uops (no separate load), and you don't need a zeroed vector. – Peter Cordes Apr 27 '16 at 04:28