Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set?
For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm
register, for a byte-wise display is the following preferred (little-endian):
[1, 0, 0, 0, ..., 0, 20]
or is this preferred:
[20, 0, 0, 0, ..., 0, 1]
Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to display the register as DWORDs, I assume each DWORD is still written in the usual (big-endian) way, but what is the order of the DWORDS:
[0x1, 0x0, ..., 0x14]
vs
[0x14, 0x0, ..., 0x1]
Discussion
I think the two most promising answers are simply "LSE1 first" (i.e., the first output in the examples above) or "MSE first" (the second output). Neither depends on the endianness of the platform, as indeed once in a register data is generally endian independent (just like operations on a GP register or a long
or int
or whatever in C are independent of endianness). Endianness comes up in the register <-> memory interface, and here I'm asking about data already in a register.
It is possible that other answers exist, such as output that depends on endianness (and Paul R's answer may be one, but I can't tell).
LSE First
One advantage of LSE-first seems to be especially with byte-wise output: often the bytes are numbered from 0 to N, with the LSB being zero2, so LSB-first output outputs it with increasing indexes, much like you'd output an array of bytes of size N.
It's also nice on little endian architectures since the output then matches the in-memory representation of the same vector stored to memory.
MSE First
The main advantage here seems to be that the output for smaller elements is in the same order as for larger sizes (only with different grouping). For example, for a 4-byte vector in MSB notation [0x4, 0x3, 0x2, 0x1]
, the output for byte elements, word and dword elements would be:
[0x4, 0x3, 0x2, 0x1] [ 0x0403, 0x0201 ] [ 0x04030201 ]
Essentially, even from the byte output you can just "read off" the word or dword output, or vice-versa, since the bytes are already in the usual MSB-first order for number display. On the other hand, the corresponding output for LSE-first is:
[0x1, 0x2, 0x3, 0x4] [ 0x0201 , 0x0403 ] [ 0x04030201 ]
Note that each layer undergoes swaps relative to the row above it, so it's much harder to read off larger or smaller values. You'd need to rely more on outputting the element that is the most natural for your problem.
This format also has the advantage that on BE architectures the output then matches the in-memory representation of the same vector stored to memory3.
Intel uses MSE first in its manuals.
1 Least Significant Element
2 Such numberings are not just for documentation purposes - they are architecturally visible, e.g., in shuffle masks.
3 Of course this advantage is minuscule compared to the corresponding advantage of LSE-first on LE platforms since BE is almost dead in commodity SIMD hardware.