4

Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set?

For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian):

[1, 0, 0, 0, ..., 0, 20]

or is this preferred:

[20, 0, 0, 0, ..., 0, 1]

Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to display the register as DWORDs, I assume each DWORD is still written in the usual (big-endian) way, but what is the order of the DWORDS:

[0x1, 0x0, ..., 0x14]

vs

[0x14, 0x0, ..., 0x1]

Discussion

I think the two most promising answers are simply "LSE1 first" (i.e., the first output in the examples above) or "MSE first" (the second output). Neither depends on the endianness of the platform, as indeed once in a register data is generally endian independent (just like operations on a GP register or a long or int or whatever in C are independent of endianness). Endianness comes up in the register <-> memory interface, and here I'm asking about data already in a register.

It is possible that other answers exist, such as output that depends on endianness (and Paul R's answer may be one, but I can't tell).

LSE First

One advantage of LSE-first seems to be especially with byte-wise output: often the bytes are numbered from 0 to N, with the LSB being zero2, so LSB-first output outputs it with increasing indexes, much like you'd output an array of bytes of size N.

It's also nice on little endian architectures since the output then matches the in-memory representation of the same vector stored to memory.

MSE First

The main advantage here seems to be that the output for smaller elements is in the same order as for larger sizes (only with different grouping). For example, for a 4-byte vector in MSB notation [0x4, 0x3, 0x2, 0x1], the output for byte elements, word and dword elements would be:

[0x4, 0x3, 0x2, 0x1] [ 0x0403, 0x0201 ] [ 0x04030201 ]

Essentially, even from the byte output you can just "read off" the word or dword output, or vice-versa, since the bytes are already in the usual MSB-first order for number display. On the other hand, the corresponding output for LSE-first is:

[0x1, 0x2, 0x3, 0x4] [ 0x0201 , 0x0403 ] [ 0x04030201 ]

Note that each layer undergoes swaps relative to the row above it, so it's much harder to read off larger or smaller values. You'd need to rely more on outputting the element that is the most natural for your problem.

This format also has the advantage that on BE architectures the output then matches the in-memory representation of the same vector stored to memory3.

Intel uses MSE first in its manuals.


1 Least Significant Element

2 Such numberings are not just for documentation purposes - they are architecturally visible, e.g., in shuffle masks.

3 Of course this advantage is minuscule compared to the corresponding advantage of LSE-first on LE platforms since BE is almost dead in commodity SIMD hardware.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • My personal opinion is that I prefer the little-endian representation, but I'm not aware of a standard convention, and this question seems to be rather "opinion-based". I'd imagine that many debuggers would make this a configurable option, just like the ability to switch between displaying byte-sized values, DWORD-sized values, double values, etc. – Cody Gray - on strike Dec 28 '16 at 08:00
  • My rule of thumb is: match the equivalent layout in memory, so if you have `0x1 0x2 0x3 ... 0xf` in memory, and you load it to a vector register, then displaying the contents of the vector register should also look like `0x1 0x2 0x3 ... 0xf`. – Paul R Dec 28 '16 at 09:02
  • @PaulR I'm pretty sure you'd get `0xf ... 0x3 0x2 0x1` for that memory layout :D – Margaret Bloom Dec 28 '16 at 09:09
  • @MargaretBloom: well if you use the `%v` format extensions for `printf` that are supported by some compilers (e.g. Apple's gcc and clang) then this is the behaviour that you get, and I find it helpful, as you can almost forget about the vagaries of little endianness. – Paul R Dec 28 '16 at 09:14
  • @CodyGray - it would be purely opinion based if it were "What's the best way to represent..." - but here I'm just asking if there is an existing convention, so I can follow it: a yes/no quesiton which could in principle be answered based on existing facts. Of course, opinions could differ on how much existing behavior is needed to declare it a convention, or who should get to define convention - but of course almost all questions have some degree of _judgement_ required along those lines. – BeeOnRope Dec 28 '16 at 16:29
  • @PaulR - it's not clear to me what you mean, but I commented a bit more on your answer. Perhaps what Margaret is getting at is that only on a LE architecture is it natural to display memory like `0x1 0x2 0x3` loaded into a vector as `0x1 0x2 0x3`. FWIW, Intel seems to use MSB-first in all its documentation, despite being a LE architecture! – BeeOnRope Dec 28 '16 at 16:41
  • @BeeOnRope: see further comments below answer... – Paul R Dec 28 '16 at 17:11

2 Answers2

3

Being consistent is the most important thing; If I'm working on existing code that already has LSE-first comments or variable names, I match that.

Given the choice, I prefer MSE-first notation in comments, especially when designing something with shuffles or especially packing/unpacking to different element sizes.

Intel uses MSE-first not only in their diagrams in manuals, but in the naming of intrinsics/instructions like pslldq (byte shift) and psrlw (bit-shift): a left bit/byte shift goes towards the MSB. LSE-first thinking doesn't save you from mentally reversing things, it means you have to do it when thinking about shifts instead of loads/stores. Since x86 is little-endian, you sometimes have to be thinking about this anyway.


In MSE-first thinking about vectors, just remember that memory order is right to left. When you need to think about overlapping unaligned loads from a block of memory, you can draw the memory contents in right-to-left order, so you can look at vector-length windows of it.

In a text editor, it's no problem to add new text at the left hand side of something and have the existing text displaced to the right, so adding more elements to a comment isn't a problem.

Two major downsides to MSE-first notation are:

  • harder to type the alphabet backwards (like h g f e | d c b a for an AVX vector of 32-bit elements), so I sometimes just start from the right and type a, left-arrow, b, space, ctrl-left arrow, c, space, ... or something like that.

  • Opposite from C array-initializer order. Normally not a problem, because _mm_set_epi* uses MSE-first order. (Use _mm_setr_epi* to match LSE-first comments).


An example where MSE-first is nice is when trying to design a lane-crossing version of 256b vpalignr: See my answer on that question How to concatenate two vector efficiently using AVX2?. That includes design-notes in MSE-first notation.

As another example, consider implementing a variable-count byte-shift across a whole vector. You could make a table of pshufb control vectors, but that would be a massive waste of cache footprint. Much better to load a sliding window from memory:

/*  Example of using MSE notation for memory as well as vectors

// 4-element vectors to keep the design notes compact
// I started by just writing down a couple rows of this, then noticing which way they lined up
<< 3:                       00 FF FF FF
<< 1:                 02 01 00 FF
   0:              03 02 01 00
>> 2:        FF FF 03 02
>> 3:     FF FF FF 03
>> 4:  FF FF FF FF

       FF FF FF FF 03 02 01 00 FF FF FF FF
  highest address                       lowest address
*/

#include <immintrin.h>
#include <stdint.h>
// positive counts are right shifts, negative counts are left
// a left-only or right-only implementation would only have one side of the table,
// and only need 32B alignment for the constant in memory to prevent cache-line splits.
__m128i vshift(__m128i v, intptr_t bytes_right)
{   // intptr_t means the caller has to sign-extend it to the width of a pointer, saving a movsx in the non-inline version

   // C11 uses _Alignas, C++11 uses alignas
    _Alignas(64) static const int32_t shuffles[] = { 
        -1, -1, -1, -1,
        0x03020100, 0x07060504, 0x0b0a0908, 0x0f0e0d0c,
        -1, -1, -1, -1
    };  // compact but messy with a mix of ordering :/
    const char *identity_shuffle = 16 + (const char*)shuffles;  // points to the middle 16B

    //  count &= 0xf;  tricky to efficiently limit the count while still allowing >>16 to zero the vector, and to allow negative.
    __m128i control = _mm_load_si128((const __m128i*) (identity_shuffle + bytes_right));
    return _mm_shuffle_epi8(v, control);
}

This is kind of the worst-case for MSE-first, because right-shifts take a window from farther left. In LSE-first notation, it might look more natural. Still, unless I got something backwards :P, I think it shows that you can successfully use MSE-first notation even for something you'd expect to be tricky. It didn't feel mind-bending or over-complicated. I just started writing down shuffle control vectors and then lined them up. I could have made it slightly simpler when translating to a C array if I'd used uint8_t shuffles[] = { 0xff, 0xff, ..., 0, 1, 2, ..., 0xff };. I haven't tested this, only that it compiles to one instruction:

    vpshufb xmm0, xmm0, xmmword ptr [rdi + vshift.shuffles+16]
    ret

MSE lets you notice more easily when you can use a bit-shift instead of a shuffle instruction, to reduce pressure on port 5. e.g. psllq xmm, 16/_mm_slli_epi64(v,16) to shift word elements left by one (with zeroing at qword boundaries). Or when you need to shift byte elements, but the only available shifts are 16-bit or wider. The narrowest variable-per-element shifts are 32-bit elements (vpsllvd).

MSE makes it easy to get the shuffle constant right when using larger or smaller granularity shuffles or blends, e.g. pshufd when you can keep pairs of word elements together, or pshufb to shuffle words across the whole vector (because pshuflw/hw is limited).

_MM_SHUFFLE(d,c,b,a) goes in MSE order, too. So does any other way of writing it as a single integer, like C++14 0b11'10'01'00 or 0xE4 (the identity shuffle). Using LSE-first notation will make your shuffle constants look "backwards" relative to your comments. (except for pshufb constants, which you can write with _mm_setr)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

My rule of thumb is: match the equivalent layout in memory, so if you have 0x1 0x2 0x3 ... 0xf in memory, and you load it to a vector register, then displaying the contents of the vector register should also look like 0x1 0x2 0x3 ... 0xf.

If you use the %v format extensions for printf that are supported by some compilers (e.g. Apple's gcc and clang) then this is the behaviour that you get, and I find it helpful, as you can almost forget about the vagaries of little endianness, e.g.

#include <stdio.h>
#include <stdint.h>
#include <xmmintrin.h>

int main(void)
{
    uint8_t a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

    __m128i v = _mm_loadu_si128((__m128i *)a);

    printf("v = %#vx\n", v);
    printf("v = %#vhx\n", v);
    printf("v = %#vlx\n", v);

    return 0;
}

With a suitable compiler this gives:

v = 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf 0x10
v = 0x201 0x403 0x605 0x807 0xa09 0xc0b 0xe0d 0x100f
v = 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thanks Paul. It is not actually clear to me what you mean by "match the layout in memory". Do you mean that you would always print the least significant byte first, regardless of the endianness of the architecture, and so on a big-endian architecture you'd print `0xf 0xe ... 0x1` as a load has the reversed effect? Or do you mean that on a BE architecture, where the same load results in a reversed order (i.e., 0x1 is in the MSB now), you'd reverse the output order so it still shows up as `0x1 0x2 ...`? Your example and description can be interpreted both ways, I think. – BeeOnRope Dec 28 '16 at 16:38
  • 1
    Yes, it's confusing I know, and as someone who has worked on both BE and LE SIMD for many years it still trips me up occasionally. I think we're just talking about how to display/interpret SIMD vector contents, as per your question, i.e. how you might display registers in a debugger or for debug printf statements, or even just for documentation purposes, yes ? In which case, I would re-iterate the answer above, but perhaps qualify it by saying that I would represent the vector *elements* in the same order as the elements in memory, regardless of endianness. So the C example above... – Paul R Dec 28 '16 at 17:08
  • ...illustrates this (for a little endian architecture), and to my mind the printf extensions in gcc/clang do the right thing when displaying vectors, as far as order of elements is concerned. – Paul R Dec 28 '16 at 17:09
  • Well it's still not clear to me. You'd need to run the example above on both LE and BE architectures, or have documentation about how %v works. The documentation [I could find](http://www.manpages.info/macosx/printf.3.html) for `printf` doesn't explain how it works. How do you expect %v will display the above code on a BE arch? I believe it will display `0x10, 0xf, 0xe...` since the data in the register following the load will have the opposite order, and as far as I know all printf specifiers are endian-independent. For it to display 1, 2 it would be unusual. – BeeOnRope Dec 28 '16 at 17:32
  • Let me ask a more specific question. On a BE architecture, ignoring `printf` how would your rule of thumb result in printing out a vector loaded from memory `0x1 0x2 0x3 ... ` as in your example? I find introducing memory layout here confusing. Register contents don't have any inherent endianness so you can talk about them, including displaying them, without regard for endianness usually. All the operations generally work in the expected way across LE and BE. So by relating the display of vector registers to memory layout, it introduces endianness into it, maybe where it doesn't belong. – BeeOnRope Dec 28 '16 at 17:34
  • @BeeOnRope: it's been a while since I did any BE SIMD (RIP AltiVec) but I think it is the same as for the LE example above (apart from the ordering of bytes within elements of course), i.e. it matches memory order. In general one doesn't case about which end of the SIMD vector is least significant (and documentation tends to be inconsistent in this area anyway), so you just want something that's conceptually easy to work with. I'm sure you could make cases for other approaches, so it may just all boil down to personal preference. Maybe it would help if you explained your overall goal ? – Paul R Dec 28 '16 at 17:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/131732/discussion-between-beeonrope-and-paul-r). – BeeOnRope Dec 28 '16 at 17:45