How to efficiently transpose a 2D bit matrix

Question

I keep stumbling over this problem (e.g. in this question). Given a 2D bit matrix/board/array in the form of an array of primitive integer types, e.g. an array of long. For simplicity, we can assume a square matrix, e.g., an array of 64 long values on platforms that have 64 bit long.

Let x[i] for 0 <= i < 64 be the input array. Compute an array y[i] for 0 <= i <= 64 such that:

(x[i] >> j) & 1 == (y[j] >> i) & 1

Here x >> i is the bitwise right-shift of x by i bits, & is bitwise and, and x[i] is the value at ith position in array x.

How to implement a function that maps array x to array y most efficiently?

Primarily I am looking for non-destructive methods, which leave the input array x intact.

Implementation language

The used programming language should have arrays and bitwise operations on integer types. Many languages fulfill these requirements. C/C++ and Java solutions will look very similar, so let's choose these languages.

This sounds like homework or a test, though very advanced. I propose you answer it yourself; your reputation says you are advanced enough. Good luck! — Paul Ogilvie, Jan 21 '17 at 10:43
@Paul I can assure you that this is not a homework question. It's a problem I encounter from time to time (see the added link). I might submit a solution myself if no one answers here. I think the problem is of general interest to deserve a Q&A on SO. — ziggystar, Jan 21 '17 at 10:44
Don't you mean `(x[i] >> j)`, so with a right shift? The `& 1` will not give an interesting result if you shift left by a non-zero amount? — Freek Wiedijk, Jan 21 '17 at 10:44
*"I might submit a solution myself if no one answers here."* No, no, no. You submit *your* solution first, then ask here if it doesn't work. Alternately, if you want others to review your solution, ask on http://codereview.stackexchange.com/, not here. — Andreas, Jan 21 '17 at 10:46
I'm voting to close this question as off-topic because [questions asking for (home)work help must include a summary of the work done so far to solve the problem, and a description of the difficulty solving it](http://stackoverflow.com/help/on-topic). — Andreas, Jan 21 '17 at 10:47
@Andres 1. I'll have to code, currently I don't have one. 2. I think the problem is of general interest as it might come up as an algorithmic challenge in every programmers live, don't you agree? Having a good solution on SO would be helpful. — ziggystar, Jan 21 '17 at 10:48
@ziggystar Then you create a question and immediately self-answer it, if that's your goal. You don't post a question that *looks like* an assignment (*"Is your implementation also applicable to the following cases?"*), expecting other to write the code for you. — Andreas, Jan 21 '17 at 10:50
And guys, please don't assume this is homework just because I know how to write good questions/specifications. This is really discouraging. — ziggystar, Jan 21 '17 at 10:50
Duplicate of [Bitwise transpose of 8 bytes](http://stackoverflow.com/q/2243985/5221149) — Andreas, Jan 21 '17 at 10:52
Ok, the duplicate links are helpful (in contrast to this stupid discussion, and the close vote because of too broad). I remember having such a discussion, and having my question moved back and forth from/to code review the last time I asked a performance question. So which tag do I have to avoid? — ziggystar, Jan 21 '17 at 10:54
@ziggystar: you have a high user level, you should know that this kind of question is not a specific coding question suitable for SO. Too broad. It would be rather for softwareengineering.stackexchange.com or cs.stackexchange.com. — Seki, Jan 21 '17 at 12:11
@seki No I don't. From my highest rated questions I'd say that at least [#1](http://stackoverflow.com/questions/8524878/implicit-conversion-vs-type-class) and [#5](http://stackoverflow.com/questions/7089347/will-using-scala-in-a-more-functional-way-scalaz-incur-a-performance-maintaina) are much broader than this question, which appears to have a quite narrow scope, and it is clear how a good answer looks like. I think the cause of this problem here is that on SO there are large differences between communities associated to different tags. At least I know that this discussion belongs to Meta. — ziggystar, Jan 21 '17 at 12:28
@ziggystar I'd vote to reopen the question, but it's really a bit too broad. So the people would probably vote to leave closed. The problem is that your side question make any possible answer terribly long. If you constrained it to some base variant, it could be a good question. I'd personally drop everything but the very base ("Additional questions" make it huge, "Destructive vs. non-destructive" - we'll see, "Implementation language" - it's fine, but the tags suffice). — maaartinus, Jan 22 '17 at 19:17
Possible duplicate of [What is the fastest way to rotate the bits in an 8x8 block on bits?](https://stackoverflow.com/questions/6930667/what-is-the-fastest-way-to-rotate-the-bits-in-an-8x8-block-on-bits) — phuclv, Aug 02 '18 at 10:37

score 9 · Accepted Answer · edited May 23 '17 at 12:24

This seems a generalization of the question Bitwise transpose of 8 bytes. That question was just about 8x8 transposition, so what you are asking is a bit different. But your question is answered just as well in section 7.3 of the book Hacker's Delight (you might be able to see the relevant pages on Google books). The code that is presented there apparently originates with Guy Steele.

The Hacker's Delight website only contains the source code from the book for the 8x8 and 32x32 cases, but the latter generalizes trivially to your 64x64 case:

#include <stdint.h>

void
transpose64(uint64_t a[64]) {
  int j, k;
  uint64_t m, t;

  for (j = 32, m = 0x00000000FFFFFFFF; j; j >>= 1, m ^= m << j) {
    for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
      t = (a[k] ^ (a[k | j] >> j)) & m;
      a[k] ^= t;
      a[k | j] ^= (t << j);
    }
  }
}

The way that this works is that the function swaps successively smaller blocks of bits, starting with 32x32 blocks (without transposing the bit within those blocks), after that within those 32x32 blocks it swaps the appropriate 16x16 blocks, etc. The variable that holds the block size is j. Therefore, the outer loop has j succcessively take the values 32, 16, 8, 4, 2 and 1, which means that the outer loop runs six times. The inner loop runs over half the lines of your of bits, the lines where a given bit in the variable k is equal to zero. When j is 32 those are the lines 0-31, when j is 16 those are the lines 0-15 and 32-47, etc. Together the inner part of the loop runs 6*32 = 192 times. What happens inside this inner part is that the mask m determines what are the bits that should be swapped, in t the xor or those bits are calculated, and that xor-ed lists of bits is used to update the bits in both places appropriately.

The book (and the website) also has a version of this code in which these loops have both been unrolled, and where the mask m is not calculated, but just assigned. I guess it depends on things like the number of registers and the size of your instruction cache whether that is an improvement?

To test that this works, suppose we define some bit pattern, say:

uint64_t logo[] = {
0b0000000000000000000000000000000000000000000100000000000000000000,
0b0000000000000000000000000000000000000000011100000000000000000000,
0b0000000000000000000000000000000000000000111110000000000000000000,
0b0000000000000000000000000000000000000001111111000000000000000000,
0b0000000000000000000000000000000000000000111111100000000000000000,
0b0000000000000000000000000000000000000000111111100000000000000000,
0b0000000000000000000000000000000000000000011111110000000000000000,
0b0000000000000000000000000000000000000000001111111000000000000000,
0b0000000000000000000000000000000000000000001111111100000000000000,
0b0000000000000000000000000000000010000000000111111100000000000000,
0b0000000000000000000000000000000011100000000011111110000000000000,
0b0000000000000000000000000000000111110000000001111111000000000000,
0b0000000000000000000000000000001111111000000001111111100000000000,
0b0000000000000000000000000000011111111100000000111111100000000000,
0b0000000000000000000000000000001111111110000000011111110000000000,
0b0000000000000000000000000000000011111111100000001111111000000000,
0b0000000000000000000000000000000001111111110000001111111100000000,
0b0000000000000000000000000000000000111111111000000111111100000000,
0b0000000000000000000000000000000000011111111100000011111110000000,
0b0000000000000000000000000000000000001111111110000001111111000000,
0b0000000000000000000000000000000000000011111111100001111111100000,
0b0000000000000000000000001100000000000001111111110000111111100000,
0b0000000000000000000000001111000000000000111111111000011111110000,
0b0000000000000000000000011111110000000000011111111100001111100000,
0b0000000000000000000000011111111100000000001111111110001111000000,
0b0000000000000000000000111111111111000000000011111111100110000000,
0b0000000000000000000000011111111111110000000001111111110000000000,
0b0000000000000000000000000111111111111100000000111111111000000000,
0b0000000000000000000000000001111111111111100000011111110000000000,
0b0000000000000000000000000000011111111111111000001111100000000000,
0b0000000000000000000000000000000111111111111110000011000000000000,
0b0000000000000000000000000000000001111111111111100000000000000000,
0b0000000000000000000000000000000000001111111111111000000000000000,
0b0000000000000000000000000000000000000011111111111100000000000000,
0b0000000000000000000111000000000000000000111111111100000000000000,
0b0000000000000000000111111110000000000000001111111000000000000000,
0b0000000000000000000111111111111100000000000011111000000000000000,
0b0000000000000000000111111111111111110000000000110000000000000000,
0b0000000000000000001111111111111111111111100000000000000000000000,
0b0000000000000000001111111111111111111111111111000000000000000000,
0b0000000000000000000000011111111111111111111111100000000000000000,
0b0000001111110000000000000001111111111111111111100000111111000000,
0b0000001111110000000000000000000011111111111111100000111111000000,
0b0000001111110000000000000000000000000111111111100000111111000000,
0b0000001111110000000000000000000000000000001111000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
};

We then call the transpose32 function and print the resulting bit pattern:

#include <stdio.h>

void
printbits(uint64_t a[64]) {
  int i, j;

  for (i = 0; i < 64; i++) {
    for (j = 63; j >= 0; j--)
      printf("%c", (a[i] >> j) & 1 ? '1' : '0');
    printf("\n");
  }
}

int
main() {
  transpose64(logo);
  printbits(logo);
  return 0;
}

And this then gives as output:

0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000011000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000100000000011111000000011111100000111111
0000000000000000000000011110000000011111100000011111100000111111
0000000000000000000001111110000000011111100000011111100000111111
0000000000000000000001111111000000011111100000011111100000111111
0000000000000000000000111111000000011111100000011111100000111111
0000000000000000000000111111100000001111110000011111100000111111
0000000000000000000000011111100000001111110000011111100000111111
0000000000000100000000011111110000001111110000011111100000111111
0000000000001110000000001111110000001111110000011111100000111111
0000000000011110000000001111111000001111110000011111100000111111
0000000001111111000000000111111000000111111000011111100000111111
0000000000111111100000000111111100000111111000011111100000111111
0000000000111111110000000011111100000111111000011111100000111111
0000000000011111111000000011111100000111111000011111100000111111
0000000000001111111100000001111110000011111000011111100000111111
0000000000000111111100000001111110000011111100011111100000111111
0000000000000011111110000000111111000011111100011111100000111111
0001000000000001111111000000111111000011111100011111100000111111
0011110000000001111111100000111111100011111100011111100000111111
0111111000000000111111110000011111100001111100011111100000111111
0111111110000000011111111000011111110001111110011111100000111111
1111111111000000001111111000001111110001111110011111100000111111
0011111111100000000111111100001111111001111110011111100000111111
0001111111111000000011111110000111111001111110011111100000111111
0000111111111100000011111111000111111100111100000000000000111111
0000001111111110000001111111100011111100000000000000000000111111
0000000111111111100000111111110011111000000000000000000000111111
0000000011111111110000011111110001100000000000000000000000111111
0000000000111111111000001111111000000000000000000000000000111111
0000000000011111111110000111111000000000000000000000000000111111
0000000000001111111111000111110000000000011111111111111111111111
0000000000000011111111100011100000000000011111111111111111111111
0000000000000001111111111001000000000000011111111111111111111111
0000000000000000111111111100000000000000011111111111111111111111
0000000000000000001111111100000000000000011111111111111111111111
0000000000000000000111111000000000000000011111111111111111111111
0000000000000000000011110000000000000000000000000000000000000000
0000000000000000000000100000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000

Which is nicely flipped, as we hoped for.

Edit:

This is actually not really what you asked for, as you asked for a non-destructive version of this code. You can get this by having the first swap of the 32x32 blocks go from x to y. For instance, you might do something like:

void
non_destructive_transpose64(uint64_t x[64], uint64_t y[64]) {
  int j, k;
  uint64_t m, t;

  for (k = 0; k < 64; k += 2) {
    ((uint32_t *) y)[k] = ((uint32_t *) x)[k ^ 64 + 1];
    ((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k + 1];
  }
  for (; k < 128; k += 2) {
    ((uint32_t *) y)[k] = ((uint32_t *) x)[k];
    ((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k ^ 64];
  }
  for (j = 16, m = 0x0000FFFF0000FFFF; j; j >>= 1, m ^= m << j) {
    for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
      t = (y[k] ^ (y[k | j] >> j)) & m;
      y[k] ^= t;
      y[k | j] ^= (t << j);
    }
  }
}

Unlike the other version of the code this does not work regardless of endianness of the architecture. Also, I know that the C standard does not allow you to access an array of uint64_t as an array of uint32_t. However, I like it that no shifts or xors are needed for the first iteration of the move-the-blocks-around loop when you do it like this.

Presenting an implementation would be better, though this is indeed helpful. — ziggystar, Jan 21 '17 at 10:52
Don't answer with link to another answer. Vote to close question as duplicate of other question. — Andreas, Jan 21 '17 at 10:53
@ziggystar: did you give me the downvote? Anyway, the [this source code](http://www.hackersdelight.org/hdcodetxt/transpose8.c.txt) link contains the implementation you were asking for. As I did not write that code, I did not want to put it here too. — Freek Wiedijk, Jan 21 '17 at 10:54
@Andreas: did you give me the downvote? Anyway, I'm new here, I don't know how to do that, please point me to documentation on how that works. And this is not _really_ a duplicate question, as the other question was about 8x8, and this one was about 64x64. — Freek Wiedijk, Jan 21 '17 at 10:56
"*did you give me the downvote*" it's not you getting the downvote, but the answer. I feel this is a significant difference. — alk, Jan 21 '17 at 11:00
I didn't downvote. And indeed the 8x8 is not easy generalizable. — ziggystar, Jan 21 '17 at 11:00
@ziggystar: but the source I pointed you to gives the generalization as well, I think. So did my answer give you what you were looking for, or could I still have given a better (for you) answer? — Freek Wiedijk, Jan 22 '17 at 23:44
I cannot look at the pages in this google book. I think they randomly remove different pages for different users. — ziggystar, Jan 23 '17 at 10:21
I see, I downvoted you, but this was just a mistake (just a bad click, no intent at all). Sorry, I can't fix it unless you edit your answer (I guess, any edit would do, but adding some relevant information would be better). — maaartinus, Jan 24 '17 at 10:44
@ziggystar: I added some source code now, plus some explanation (as you could not read those pages from the book). Better this way? — Freek Wiedijk, Jan 25 '17 at 14:45
@maaartinus: I edited my answer, adding some relevant information too. Thanks for the offer to take off the downvote! :-) — Freek Wiedijk, Jan 25 '17 at 14:46
@FreekWiedijk I didn't have much time to read your answer, but it's worth an upvote now. — maaartinus, Jan 25 '17 at 15:49

score 0 · Answer 2 · answered Nov 08 '19 at 15:20

In C++ 8x8 matrix would be like this but you can easily change it to make it more general (not just 8x8). Also, I have included main with 1 test vector just to get a feeling:

#include <iostream>
#include <string>
#include <vector>

std::vector<long> rotate(std::vector<long>& v) {
    std::vector<long> temp = { 0,0,0,0,0,0,0,0 };
    for (unsigned int i = 0; i<8; i++) {
        int number = v[i];
        for (unsigned int j = 0; j<8; j++) {
            int z = (number & (1 << (7-j)));
            if (z != 0) {
                temp[j] |= (1 << (7 - i));
            }
        }
    }
    return temp;
}



int main()
{
    std::vector<long> v = { 0, 1, 2, 3, 4, 5, 6, 7 };
    std::vector<long> rotated = rotate(v);
    for (unsigned int i = 0; i<8; i++) {
        std::cout << rotated.at(i) << " ";
    }
    return 0;
}

So, if you need it for java you can easily translate it since Java provides bit operators, too.

How to efficiently transpose a 2D bit matrix

Implementation language

2 Answers2

Edit:

Linked