Let me rephrase your question: you're asking for a C- or C++-only solution that is portable. Then:
void transpose(uint32_t const in[4], uint32_t out[4]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0] & 0xFF000000U; // A . . .
out[1] = in[1] & 0x00FF0000U; // . F . .
out[2] = in[2] & 0x0000FF00U; // . . K .
out[3] = in[3] & 0x000000FFU; // . . . P
out[1] |= (in[0] << 8) & 0xFF000000U; // B F . .
out[2] |= (in[0] << 16) & 0xFF000000U; // C . K .
out[3] |= (in[0] << 24); // D . . P
out[0] |= (in[1] >> 8) & 0x00FF0000U; // A E . .
out[2] |= (in[1] << 8) & 0x00FF0000U; // C G K .
out[3] |= (in[1] << 16) & 0x00FF0000U; // D H . P
out[0] |= (in[2] >> 16) & 0x0000FF00U; // A E I .
out[1] |= (in[2] >> 8) & 0x0000FF00U; // B F J .
out[3] |= (in[2] << 8) & 0x0000FF00U; // D H L P
out[0] |= (in[3] >> 24); // A E I M
out[1] |= (in[3] >> 8) & 0x000000FFU; // B F J N
out[2] |= (in[3] << 8) & 0x000000FFU; // C G K O
}
I don't see how it could be answered any other way, since then you'd be depending on a particular compiler compiling it in a particular way, etc.
Of course if those manipulations themselves can be somehow simplified, it'd help. So that's the only avenue of further pursuit here. Nothing stands out so far, but then it's been a long day for me.
So far, the cost is 12 shifts, 12 ORs, 16 ANDs. If the compiler and platform are any good, it can be done in 9 32 bit registers.
If the compiler is very sad, or the platform doesn't have a barrel shifter, then some casting could help extol the fact that the shifts and masks are just byte extractions:
void transpose(uint8_t const in[16], uint8_t out[16]) {
// A B C D A E I M
// E F G H B F J N
// I J K L C G K O
// M N O P D H L P
out[0] = in[0]; // A . . .
out[1] = in[4]; // A E . .
out[2] = in[8]; // A E I .
out[3] = in[12]; // A E I M
out[4] = in[1]; // B . . .
out[5] = in[5]; // B F . .
out[6] = in[9]; // B F J .
out[7] = in[13]; // B F J N
out[8] = in[2]; // C . . .
out[9] = in[6]; // C G . .
out[10] = in[10]; // C G K .
out[11] = in[14]; // C G K O
out[12] = in[3]; // D . . .
out[13] = in[7]; // D H . .
out[14] = in[11]; // D H L .
out[15] = in[15]; // D H L P
}
If you really want to shuffle it in-place, then the following would do.
void transpose(uint8_t m[16]) {
std::swap(m[1], m[4]);
std::swap(m[2], m[8]);
std::swap(m[3], m[12]);
std::swap(m[6], m[9]);
std::swap(m[7], m[13]);
std::swap(m[11], m[14]);
}
The byte-oriented versions may well produce worse code on modern platforms. Only a benchmark can tell.