Fastest way to transpose 4x4 byte matrix

Question

I have a 4x4 block of bytes that I'd like to transpose using general purpose hardware. In other words, for bytes A-P, I'm looking for the most efficient (in terms of number of instructions) way to go from

A B C D
E F G H
I J K L
M N O P

to

A E I M
B F J N
C G K O
D H L P

We can assume that I have valid pointers pointing to A, E, I, and M in memory (such that reading 32-bits from A will get me the integer containing bytes ABCD).

This is not a duplicate of this question because of the restrictions on both size and data type. Each row of my matrix can fit into a 32-bit integer, and I'm looking for answers that can perform a transpose quickly using general purpose hardware, similar to the implementation of the SSE macro _MM_TRANSPOSE4_PS.

If you're trying to do this in C without writing to special-purpose vector instructions directly, I suspect the best is just doing it directly. The transpose only involves 6 swaps (12 byte reads and 12 byte writes) and they can be scheduled efficiently. — R.. GitHub STOP HELPING ICE, Jul 18 '14 at 03:56
It would be useful to know why you need to do this - most matrix handling libs that I'm aware of rarely actually transpose a matrix, they just flag it as transposed and then access the (row,col) as (col,row) instead. No memory needs moving. — Roger Rowland, Jul 18 '14 at 04:31
@RogerRowland When transposing a 16 *byte* matrix you may well find out that any transposition flags cost more on repeated access than the transposition itself. — Kuba hasn't forgotten Monica, Jul 18 '14 at 04:38
Are these going to be numbers?Have you thought about using Eigen? — Fantastic Mr Fox, Jul 18 '14 at 05:02
@Ben It is an array of *bytes*. They aren't *going* to be anything, they already *are*. — Kuba hasn't forgotten Monica, Jul 18 '14 at 05:05
@KubaOber Yes indeed, that's why I'm curious to know more about the reasons behind the OP's question - with a large matrix, transposing can help performance via locality of reference, but as you say a very small matrix being frequently accessed via swapped indices may suffer overheads in other ways. I'd just like more background really .... — Roger Rowland, Jul 18 '14 at 06:18
@KubaOber for many cases the flag can be a compile-time property. Assuming C++, the transpose function returns a new type called TransposedMatrix for which the element-access operators are mirrored. — flodin, Jul 18 '14 at 06:45
@flodin Correct. This of course presupposed that such type can be used where it's being used. — Kuba hasn't forgotten Monica, Jul 18 '14 at 07:03
@RogerRowland, one example of a useful byte transpose is transpose four pixels from RGBARGBARGBARGBA to RRRRGGGGBBBBAAAA. — Z boson, Jul 18 '14 at 09:03
@Zboson That's a good example - I hadn't thought of pixel formats in terms of matrices. — Roger Rowland, Jul 18 '14 at 09:06
@RogerRowland, I used it here https://stackoverflow.com/questions/22244629/efficient-way-to-convert-from-premultiplied-float-rgba-to-8-bit-rgba/22249820#22249820. Writing that answer I realized that the transpose could be done with one instruction: pshufb. — Z boson, Jul 18 '14 at 09:17
@Zboson Cool, and you just earned another upvote for that answer :-) — Roger Rowland, Jul 18 '14 at 09:18
@YvesDaoust I'm looking for a solution that has the same constraints as Kuba Ober's. I'm not convinced that his solution is the fastest way to do this, but the more I think about it the more I'm beginning to think it may be. :( — Mokosha, Jul 19 '14 at 15:47
Actually, these solutions are handling all the bytes separately, using 16 reads and 16 writes (or 12 in the in-place case). The portable arithmetic operations provide little to perform any kind of byte shuffling within or between registers (besides shifts). — , Jul 19 '14 at 16:03

score 13 · Answer 1 · answered Jul 18 '14 at 07:12

You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.

Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.

The SSE equivalent of _MM_TRANSPOSE4_PS for bytes is to use _mm_shuffle_epi8 (the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.

//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
    char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
    __m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);

    __m128i v = _mm_loadu_si128((__m128i*)x);
    v = _mm_shuffle_epi8(v,mask);
    _mm_storeu_si128((__m128i*)x,v);
    for(int i=0; i<16; i++) printf("%d ", x[i]); printf("\n");
    //output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16   
}

Using `_mm_lddqu_si128()` will perform better under certain circumstances, but never perform worse. If `x[]` was 16-byte-aligned, then it could be even faster using `_mm_load_si128` and `_mm_store_si128`. — St0fF, Jul 13 '16 at 22:55
@St0fF, thanks! I had never heard of `_mm_lddqu_si128()` until now. Interesting. — Z boson, Jul 14 '16 at 08:29

Kuba hasn't forgotten Monica · Accepted Answer · 2014-07-19T17:07:15.620

Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:

void transpose(uint32_t const in[4], uint32_t out[4]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0] = in[0] & 0xFF000000U; // A . . .
  out[1] = in[1] & 0x00FF0000U; // . F . .
  out[2] = in[2] & 0x0000FF00U; // . . K .
  out[3] = in[3] & 0x000000FFU; // . . . P

  out[1] |= (in[0] <<  8) & 0xFF000000U; // B F . .
  out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
  out[3] |= (in[0] << 24);               // D . . P

  out[0] |= (in[1] >>  8) & 0x00FF0000U; // A E . .
  out[2] |= (in[1] <<  8) & 0x00FF0000U; // C G K .
  out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P

  out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
  out[1] |= (in[2] >>  8) & 0x0000FF00U; // B F J .
  out[3] |= (in[2] <<  8) & 0x0000FF00U; // D H L P

  out[0] |= (in[3] >> 24);               // A E I M
  out[1] |= (in[3] >>  8) & 0x000000FFU; // B F J N
  out[2] |= (in[3] <<  8) & 0x000000FFU; // C G K O
}

I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.

Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.

So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.

If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:

void transpose(uint8_t const in[16], uint8_t out[16]) {
  // A B C D    A E I M
  // E F G H    B F J N
  // I J K L    C G K O
  // M N O P    D H L P

  out[0]  = in[0];  // A . . .
  out[1]  = in[4];  // A E . .
  out[2]  = in[8];  // A E I .
  out[3]  = in[12]; // A E I M
  out[4]  = in[1];  // B . . .
  out[5]  = in[5];  // B F . .
  out[6]  = in[9];  // B F J .
  out[7]  = in[13]; // B F J N
  out[8]  = in[2];  // C . . .
  out[9]  = in[6];  // C G . .
  out[10] = in[10]; // C G K .
  out[11] = in[14]; // C G K O
  out[12] = in[3];  // D . . .
  out[13] = in[7];  // D H . .
  out[14] = in[11]; // D H L .
  out[15] = in[15]; // D H L P
}

If you really want to shuffle it in-place, then the following would do.

void transpose(uint8_t m[16]) {
  std::swap(m[1], m[4]);
  std::swap(m[2], m[8]);
  std::swap(m[3], m[12]);
  std::swap(m[6], m[9]);
  std::swap(m[7], m[13]);
  std::swap(m[11], m[14]);
}

The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.

+1 for teaching me a new word. I'd never heard of a [barrel shifter](http://en.wikipedia.org/wiki/Barrel_shifter) before (too stuck at the HLL stuff). — Roger Rowland, Jul 18 '14 at 06:30
This can be done with one instruction with SSSE3: pshufb (ignoring the load and store). See my answer. — Z boson, Jul 18 '14 at 07:35
@Zboson The question provides no platform, so to me, anything explicitly bound to one platform is outside the scope of the answer. It's still good to know, but I specifically didn't mean to invoke any platform-specific intrinsics. — Kuba hasn't forgotten Monica, Jul 18 '14 at 07:38
@KubaOber, I understand. Though I'm not sure my answer is out of the scope of the OPs question The OP is asking for two things and he can't have them both. You answered one of them (portability) and I wanted to show that the other (efficiency) can't be attained without violating the first one (12 shifts, 12 ORs, 16 ANDs is a lot worse than one shuffle). — Z boson, Jul 18 '14 at 07:47
The last two lines of your first code example are incorrect. You should be right shifting by 16 and 8 respectively. — Chris_F, Jan 29 '21 at 20:49

score 1 · Answer 3 · answered Jul 18 '14 at 05:08

Not sure about the speed but these are okay.

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size][Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I][J], Data[J][I]);
        }
    }
}

template<typename T, std::size_t Size>
void Transpose(T (&Data)[Size * Size])
{
    for (int I = 0; I < Size; ++I)
    {
        for (int J = 0; J < I; ++J)
        {
            std::swap(Data[I * Size + J], Data[J * Size + I]);
        }
    }
}

score 1 · Answer 4 · 2014-07-18T07:21:34.750

An efficient solution is possible on a 64 bits machine, if you accept that. First shift the 32 bits integer constants by (0,) 1, 2 and 3 bytes respectively [3 shitfs]. Then mask out the unwanted bits and perform logical ORs [12 ANDs with a constant, 12 ORs]. Finally, shift back to 32 bits [3 shifts] and read out the 32 bits.

ABCD
EFGH
IJKL
MNOP

ABCD
 EFGH
  IJKL
   MNOP

A---
 E---
  I---
   MNOP
=======
AEIMNOP
AEIM

AB--
 -F--
  -J--
   -NOP
=======
ABFJNOP
BFJN

ABC-
 --G-
  --K-
   --OP
=======
ABCGKOP
CGKO

ABCD
 ---H
  ---L
   ---P
=======
ABCDHLP
DHLP

score 1 · Answer 5 · edited May 23 '17 at 12:17

I posted an answer for this same problem awhile back ago for SSE here.

The only things that need to be added are vectorized load/store operations.

This answer is similar to Z boson's answer to this question. Examples for load/store can be seen there. This answer differs because in addition to the SSE3 implementation there is an SSE2 implementation which is guaranteed to run on any x64 processor.

It's worth noting that both of these solutions assume that the entire matrix is row major in memory, but OP's question states that each row could have it's own pointer which implies the array could be fragmented.

Fastest way to transpose 4x4 byte matrix

5 Answers5

Linked