How to efficiently de-interleave bits (inverse Morton)

Question

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solution which extracts both parts (the odd bits and the even bits) in as few operations as possible.

For my use I would need to take a 32 bit int and extract two 16 bit ints, where one is the even bits and the other is the odd bits shifted right by 1 bit, e.g.

input,  z: 11101101 01010111 11011011 01101110

output, x: 11100001 10110111 // odd bits shifted right by 1
        y: 10111111 11011010 // even bits

There seem to be plenty of solutions using shifts and masks with magic numbers for generating Morton numbers (i.e. interleaving bits), e.g. Interleave bits by Binary Magic Numbers, but I haven't yet found anything for doing the reverse (i.e. de-interleaving).

UPDATE

After re-reading the section from Hacker's Delight on perfect shuffles/unshuffles I found some useful examples which I adapted as follows:

// morton1 - extract even bits

uint32_t morton1(uint32_t x)
{
    x = x & 0x55555555;
    x = (x | (x >> 1)) & 0x33333333;
    x = (x | (x >> 2)) & 0x0F0F0F0F;
    x = (x | (x >> 4)) & 0x00FF00FF;
    x = (x | (x >> 8)) & 0x0000FFFF;
    return x;
}

// morton2 - extract odd and even bits

void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
    *x = morton1(z);
    *y = morton1(z >> 1);
}

I think this can still be improved on, both in its current scalar form and also by taking advantage of SIMD, so I'm still interested in better solutions (either scalar or SIMD).

The interleaving solution you link to uses twice as many operations as the de-interleaving solution you link to. If this is acceptable, you can achieve the same performance by applying the de-interleaving solution twice. I don't think you can do any better than that, since both solutions use the same principle and have stages where half of the bits are 0, so they can only process half of the information in one go, so if you want all the information you need two gos. Of course you can do it in one go if you have 64-bit integers; then you can move one of the parities into the upper 32 bits. — joriki, Feb 05 '11 at 20:20
@joriki: thanks - I was hoping that there might be a more efficient way of extracting both parts simultanously. I'm looking at the "perfect shuffle" examples in Hacker's Delight and wondering if these are applicable (results in high and low half of output word). I'm also wondering whether I can use SIMD (SSE3/SSE4) to perhaps get a 2x improvement. — Paul R, Feb 05 '11 at 20:25
I played around with this some more -- I didn't come up with a better solution but I did make a somewhat interesting observation: You could efficiently change AaBbCcDd.. into ABabCDcd.. if you could efficiently change 0aB00cD0.. into 0Ba00Dc0.. -- so you can reduce this step to efficiently swapping two bits, which means implementing the map 0->0, 3->3, 1->2, 2->1. The reversible operations on two bits (mod 4) that I can think of are: adding 0, 1, 2 or 3, XORing with 1 or 3, or multiplying by 3. But these only generate an 8-element subgroup of S_4 that doesn't include the required permuation. — joriki, Feb 06 '11 at 09:41
@joriki: interesting, thanks - I'll look into this in some more detail tomorrow when I have more time. I had one other thought, which I also haven't had time to investigate yet: if you keep applying the *interleave* operation, do you eventually get back to where you started, i.e. is it cyclical ? If so, can this fact be exploited to carry out the de-interleave operation ? — Paul R, Feb 06 '11 at 10:01
I assume that by "interleave operation" you mean treating the upper 16 bits of a 32-bit word as odd bits and the lower 16 bits as even bits and obtaining a new 32-bit word by interleaving them? The abstract answer is that yes, it's cyclical, since it's a bijective operation and there is a finite number of different 32-bit words :-) But more practically speaking, the cycle length is 5: The interleaving operation cycles the digits in the binary representation of the bit indices, the least significant digit becoming the most significant, and there are 5 digits to cycle through for a 32-bit word. — joriki, Feb 06 '11 at 11:29
Another thought I had, thinking out of the box a bit: Do you need the odd and even bits in the right order? Or could you restructure the rest of your code (for instance by using different lookup tables) such that you can accept them in a different order? Because getting them in a different order is very easy: odd = x & 0xaaaaaaaa; odd = (odd | (odd >>> 17)) & 0xffff; even = x & 0x55555555; even = (even | (even >>> 15)) & 0xffff; — joriki, Feb 06 '11 at 11:34
@joriki: unfortunately I need the bits in the right order - I'm going to be using them as indices into an array that I need to iterate through in Morton order. — Paul R, Feb 06 '11 at 12:04

score 12 · Accepted Answer · answered Feb 07 '11 at 19:16

12

If your processor handles 64 bit ints efficiently, you could combine the operations...

int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F; 
...

answered Feb 07 '11 at 19:16

AShelly

34,686
15
91
152

1

Ah yes - that's a good idea - I will only be working 64 bit CPUs so this is definitely worth doing. – Paul R Feb 07 '11 at 19:33

score 7 · Answer 2 · answered May 31 '15 at 19:52

Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.

#include <immintrin.h>
#include <stdint.h>

// on GCC, compile with option -mbmi2, requires Haswell or better.

uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
    return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}

uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
    *x = _pext_u64(m, 0x5555555555555555);
    *y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}

ponchietto · Answer 3 · 2015-02-06T12:37:28.857

In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:

uint64_t morton3(uint64_t x) {
    x = x & 0x9249249249249249;
    x = (x | (x >> 2))  & 0x30c30c30c30c30c3;
    x = (x | (x >> 4))  & 0xf00f00f00f00f00f;
    x = (x | (x >> 8))  & 0x00ff0000ff0000ff;
    x = (x | (x >> 16)) & 0xffff00000000ffff;
    x = (x | (x >> 32)) & 0x00000000ffffffff;
    return x;
}
uint64_t bits; 
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)

score 2 · Answer 4 · answered Dec 10 '21 at 06:06

You can extract 8 interleaved bits by multiplying like so:

uint8_t deinterleave_even(uint16_t x) {
    return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
    return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}

It should be trivial to combine them for 32 bits or larger.

score 1 · Answer 5 · answered Dec 10 '16 at 00:17

I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:

andl    $0x55555555, %eax
movl    %eax, %ecx
shrl    %ecx
orl     %eax, %ecx
andl    $0x33333333, %ecx
movl    %ecx, %eax
shrl    $2, %eax
orl     %ecx, %eax
andl    $0xF0F0F0F, %eax
movl    %eax, %ecx
shrl    $4, %ecx
orl     %eax, %ecx
movzbl  %cl, %esi
shrl    $8, %ecx
andl    $0xFF00, %ecx
orl     %ecx, %esi

TL;DR source repo and live demo.

Implementation

Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:

0b0101010101010101 (alternate 1 and 0)
0b0011001100110011 (alternate 2x 1 and 0)
0b0000111100001111 (alternate 4x 1 and 0)
0b0000000011111111 (alternate 8x 1 and 0)

If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:

/// @brief Generates 0b1...1 with @tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;

/// @brief Performs `@tparam input | (@tparam input << @tparam width` @tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
    public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// @brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
    (input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};

Now that we can generate the constants at compile time for arbitrary dimensions with the following:

template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;

With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:

template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
    static T work(T input) {
        input = deinterleave<T, step - 1, dimensions>::work(input);
        return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
    }
};
// Omitted specialization for step 0, where there is just a bitwise and

It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:

template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};

/// @brief Helper constexpr which returns the number of steps needed to fully interleave a type @tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;

And finally, we can perform one single call:

/// @brief Helper function which combines @see deinterleave and @see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
    return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}

Very nice generic implementation. – Paul R Dec 10 '16 at 02:44 — Paul R, Dec 10 '16 at 02:44

GJ. · Answer 6 · 2011-02-07T04:21:42.677

If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.

const
  MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;

procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
  movzx   ecx, al                                     //Use 0th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 2th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  shl     edx, 16
  movzx   ecx, al                                     //Use 1th byte
  mov     dl, byte ptr[MortonTableLookup + ecx]
//
  shr     eax, 8
  movzx   ecx, ah                                     //Use 3th byte
  mov     dh, byte ptr[MortonTableLookup + ecx]
//
  mov     ecx, edx  
  and     ecx, $F0F0F0F0
  mov     eax, ecx
  rol     eax, 12
  or      eax, ecx

  rol     edx, 4
  and     edx, $F0F0F0F0
  mov     ecx, edx
  rol     ecx, 12
  or      edx, ecx
end;

Thanks - I'm also considering a table lookup approach, but using SSSE3's PSHUFB instruction to perform multiple 4 bit lookups in parallel - the advantage of this, in addition to parallelism, would be that it doesn't require memory accesses for the table lookup. — Paul R, Feb 07 '11 at 09:29

How to efficiently de-interleave bits (inverse Morton)

6 Answers6

Implementation

Linked