Why does shift right in practice shifts left (and viceversa) in Neon and SSE?

Question

(Note, in Neon I am using this data type to avoid dealing with conversions among 16-bit data types)

Why does "shift left" in intrinsics in practice "shift right"?

// Values contained in a
// 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147 152
b = vshlq_n_u32(a,8);
// Values contained in b
// 0 141 138 145 0 144 140 147 0 154 147 149 0 155 152 147
b = vshrq_n_u32(a,8);
// Values contained in b
// 138 145 147 0 140 147 153 0 147 149 146 0 152 147 152 0

I remember finding the same situation when using _mm_slli_si128 (which is different though, a result after a shift will look like:

// b = _mm_slli_si128(a,1);
// 0 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147

Is it because of endianness? Will it change from platform to platform?

@BenVoigt Maybe I am using the term "bypass" in the wrong way. If I use the data type described in the link, I can use it as input and output of functions like `vshlq_n_u32`, `vget_low_u8`, `vuzp_u8` ... — Antonio, Mar 23 '15 at 11:34
Maybe you can change your hand-writing order to little endian: write lower address words in the right hand side, then it's more obvious. — user3528438, Mar 23 '15 at 11:38
Yes, this is endianness. If you print this as a collection of four unsigned 32-bit values, you will see that the instruction multiplies them by 256, and drops the upper byte (thanks for the correction). — Sergey Kalinichenko, Mar 23 '15 at 11:43
@user3528438 The problem is that in practice I am processing pixels (unsigned chars values), and they are stored in that order. — Antonio, Mar 23 '15 at 12:24
@dasblinkenlight So, very important for me, does the behaviour change from platform to platform? I understand that what is constant is that a shift left behaves like a multiplication by 2^N (throwing away what overflows). How can I guarantee the bytes will move in the intended direction? — Antonio, Mar 23 '15 at 12:32
@PaulR It really seems you are implying that then, in this case, endianness doesn't matter — Antonio, Mar 23 '15 at 16:25
Sorry for the confusion - I was reading this question along with your [earlier answer](http://stackoverflow.com/a/29210569/253056) where you seemed to be overly worried that all SIMD operations could be affected by endianness. I was trying to clarify that only cases like this, where you are mixing different element sizes, would be affected by endianness. You do indeed need to be aware of these cases. I'll delete my comments shortly as I realise now that they are confusing when taken out of context. — Paul R, Mar 23 '15 at 16:50

kfsone · Accepted Answer · 2015-03-29T22:27:10.500

You say "is this because of endianess" but it's more a case of type abuse. You're making assumptions about the bit ordering of the machine across byte/word boundaries and your non-byte instructions that impose local endianess on an operation (you're using an _u32 instruction which expects values that are unsigned 32 bit values, not arrays of 8 bit values).

As you say, you are asking it to shift a series of unsigned char values by /asking/ it to shift values in 32 bit units.

Unfortunately, you are going to need to put them in architecture order if you want to be able to do an architecture shift on them.

Otherwise you may want to look for a blit or move instruction, but you can't artificially coerce machine types into machine registers without paying architectural costs. Endianness will be just one of your headaches (alignment, padding, etc)

--- Late Edit ---

Fundamentally, you are confusing byte and bit shifts, we consider most significant bits to be "left"

bit number
87654321

hex
8421
00008421

00000001  = 0x01 (small, less significant)
10000000  = 0x80 (large, more significant)

But the values you are shifting are 32 bit words, on a little endian machine that means the each subsequent address increases a more significant byte of the value, for a 32 bit word:

bit numbers
                1111111111111111
87654321fedcba0987654321fedcba09

To represent the 32-bit value 0x0001

                1111111111111111
87654321fedcba0987654321fedcba09

00000001000000000000000000000000

To shift it left by 2 positions

00000001000000000000000000000000
     v<
00000100000000000000000000000000

to shift it left by another 8 positions we have to warp it to next address:

00000100000000000000000000000000
      >>>>>>>v
00000000000001000000000000000000

This looks like a right shift if you are thinking in bytes. But we told this little-endian CPU that we were working on a uint32, so that means:

                1111111111111111
87654321fedcba0987654321fedcba09
 word01  word02  word03  word04   
00000001000000000000000000000000 = 0x0001
00000100000000000000000000000000 = 0x0004
00000000000001000000000000000000 = 0x0400

The problem is that this is a different order than the ordering you expect for a local array of 8 bit values, but you told the CPU the values were _u32 so it used it's native endianess for the operation.

You have a point, but I don't understand how one cannot make such an abuse when using the intrinsic [_mm_slli_si128](https://msdn.microsoft.com/en-us/library/34d3k2kt%28v=vs.90%29.aspx)... It's basically designed to abuse! And in which way my dump code is wrong? It prints bites in the same order they were when I loaded them from memory... — Antonio, Mar 23 '15 at 16:29
Not quite, it's treating it as 128 bit, single sse register, value, so it's still the same neediness it's just 4 local words of it :) — kfsone, Mar 23 '15 at 16:34
"neediness" was "endianness". It's in my kindle's dictionary but it just can't seem to resist changing it. — kfsone, Mar 23 '15 at 19:28
Again, what do you mean with "your dump code is wrong for this architecture"? — Antonio, Mar 23 '15 at 20:05
It was slightly tounge in cheek, i'll edit with an extended explanation. — kfsone, Mar 29 '15 at 22:00
Thanks for the detailed explained. To complete the answer to my question, the behaviour will change when moving to a big endian machine, correct? — Antonio, Mar 29 '15 at 22:08
Yep, that's endianess, but the important thing is that you're seeing it because you're asking byte/stream data to be treated as native typed data. — kfsone, Mar 29 '15 at 22:30

score 0 · Answer 2 · edited May 23 '17 at 12:29

0

The result of these intrinsics seem to be dependent on system endianness, therefore I have put a flag ready to raise if we'll ever port the code to big endian systems

#if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__
    #pragma GCC error "Intrinsics used with little endian systems in mind. Start by reviewing all shifts operators."
#endif

See checking endianness at compile time.

edited May 23 '17 at 12:29

Community

1
1

answered Mar 23 '15 at 12:45

Antonio

19,451
13
99
197

Endianness won't affect the shift instructions: right shift always moves from MSB to LSB, and left shift always moves from LSB to MSB. However endianness does affect how you load data into register: in big endian mode, vector load puts lower address bytes in LSB in register, while in little endian mode, vector load puts lower address bytes in LSB in vector register. – user3528438 Mar 23 '15 at 13:01
@PaulR I do not understand how it doesn't matter. If I load 16 bytes from memory (e.g. grayscale values) `141 138 145 147 144 140 147 153 154 147 149 146 155 152 147 152` than shift them right `vshrq_n_u32`, little endian will give `138 145 147 0 140 147 153 0 147 149 146 0 152 147 152 0` and big endian will give `0 141 138 145 0 144 140 147 0 154 147 149 0 155 152 147`. What am I missing? – Antonio Mar 23 '15 at 15:42
1

@PaulR You are not providing whether `141` or `152` is the lower address end in memory. – user3528438 Mar 23 '15 at 15:47
@user3528438 That's the order in which my 16 bites are stored in memory, the order you get when printing the variable like this http://stackoverflow.com/questions/13257166/print-a-m128i-variable/26012188#26012188. The lower address is where 141 is stored. I feel there's something fundamental I am not getting... – Antonio Mar 23 '15 at 15:54
Then on little endian machine, you take the address if `141`, load a vector from there, then `141` will show up in register as lowest byte, which is the right-most one. If you do the same thing on a big endian machine, `141` will show up in the left-most byte. That's why I said about hand-writing order: when I write or print some small endian byte stream, I always keep lower-address words to the right of higher-address ones. – user3528438 Mar 23 '15 at 16:34
Sorry for the confusion - you seemed to be overly worried that all SIMD operations could be affected by endianness. I was trying to clarify that only cases like this, where you are mixing different element sizes, would be affected by endianness. You do indeed need to be aware of these particular cases, but in general you don't need to worry. I'll delete my comments shortly as I realise now that they are confusing when taken out of context. – Paul R Mar 23 '15 at 16:54

Why does shift right in practice shifts left (and viceversa) in Neon and SSE?

2 Answers2