4

Is there a bit twiddling hack for efficiently unpacking a 16-bit packed BCD number?

Doing it the pedestrian way requires 10 operations (3 shifts, 4 ANDs and 3 ORs or ADDs):

x = (bcd & 0xF000) << 12
  | (bcd & 0x0F00) <<  8
  | (bcd & 0x00F0) <<  4
  | (bcd & 0x000F)

With multi-way ADD/OR the critical path length would be 3 but these operations tend to be binary and so most CPUs would be looking at a critical path of length 4.

Can this be done more efficiently?

Note: for some purposes it can be equally useful if some permutation of the nibbles can be unpacked especially efficiently, like if the word to be unpacked comes from a lookup table over whose creation I have full control (so that I can stick each digit wherever I want). The purpose of using packed instead of unpacked BCD in this case would be to halve the memory pressure and to avoid exceeding the size of the L1 cache, taking some load off an over-saturated memory subsystem by increasing the load on the CPU's ALUs.

For example, if I permute the digits like 0x1324 then a simple de-interleave yields 0x01020304:

x = ((bcd << 12) | bcd) & 0x0F0F0F0F

That's just three operations with critical path length 3, quite an improvement over the original version...

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
  • 4
    The interleaving is the way to go. If you have a larger source (32 bit) and target (64 bit) then you can do a 3rd interleave, but anything above that is unlikely to get you much on current 64 bit processors. Be aware of issues such as integer processor endian and alignment, though, Also check out the SSE instructions as there are some interesting number packing and unpacking instructions. – Gem Taylor Jan 09 '20 at 18:00
  • 1
    Are you targeting a processor on which [pdep](https://www.felixcloutier.com/x86/pdep) exists and is efficient? – harold Jan 09 '20 at 18:01
  • Do you need portable code, or can you target a particular ISA? – Toby Speight Jan 09 '20 at 18:02
  • @ harold: At the moment I'm targetting C language level expressions without a specific language or processor architecture in mind. Everything I get optimised at that level cannot be b0rked by even by JIT compilers like those of C# and Java (note: the 'j' in JIT is definitely *unvoiced*). And if I can use gcc/VC++ then their optimisers won't complain about having more free registers to play with and smaller clumps of operations to perform... – DarthGizka Jan 09 '20 at 18:13
  • 1
    What does this have to do with BCD? You seem to just be asking about unpacking nibbles? – Chris Dodd Jan 09 '20 at 21:26
  • @ Chris: unpacking BCD is often a step that comes before further processing, and yes, unpacking BCD is just unpacking nibbles. I mentioned the unpacking of permuted forms because it is *sometimes* a viable option and less computationally expensive than unpacking 'proper' BCD. And yes, I'm only interested in BCD; the nibbles A through F cannot occur in the use cases I'm working on. – DarthGizka Jan 09 '20 at 22:21
  • @DarthGizka Is integer multiplication allowed as a building block? – njuffa Jan 09 '20 at 23:53
  • @ njuffa: sure, why not? The proof of the pudding is in the benchmark, and pundits will in any case pick what they like best. For example, in interpreted languages like [IDC](https://www.hex-rays.com/products/ida/support/idadoc/157.shtml) or Python, all the simple arithmetic operators are essentially equally fast because their cost is dwarfed by the interpreter overhead, This can favour different solutions than in languages that get compiled to machine code. – DarthGizka Jan 10 '20 at 06:08
  • why is this called unpacking? I would expect that `0x1234` will be unpacked into `0x04D2` ... what you have is just converting the BCD digits from 4bit into 8 bit by zero padding ... not sure what is this good for ... – Spektre Jan 10 '20 at 08:41
  • 1
    @ Spektre: it is called 'unpacking' because it converts packed BCD into unpacked BCD. – DarthGizka Jan 10 '20 at 09:58

3 Answers3

4

Here is an alternative way, with fewer operations but a longer critical path, based on the binary decomposition of the move-distance of the nibbles (moving nibbles that move by 8 or 12 steps together by 8, moving nibbles that move a distance of 4 or 12 together by 4).

x = bcd
x = ((x & 0xFF00) << 8) | (x & 0xFF)
x = ((x & 0x00F000F0) << 4) | (x & 0x000F000F)

For example:

// start
0000ABCD
// move A and B by 8
00AB00CD
// move A and C by 4
0A0B0C0D
harold
  • 61,398
  • 6
  • 86
  • 164
  • This looks pretty good! However, I have a strong feeling that it wants to get a bit slimmer (for one thing, the masking seems excessive). ;-) – DarthGizka Jan 10 '20 at 10:08
  • OK, here's my interpretation. The first step (shortened to `x |= x << 8`) sets up a value that has the upper two nibbles shifted up by 8. Each half of the second step then picks one of the two lowest nibbles and one of the two shifted high nibbles, and the two intermediate results are combined after shifting one of them up by 4. This means that the separation into two statements is strictly necessary, since step 2 references the step 1 result twice, and each half of step 2 needs an unshifted nibble and a shifted one. No redundancy left; ops: 6 (down 4), path: 5 (up 1). Pretty good! – DarthGizka Jan 10 '20 at 12:28
4

The most efficient solution will be machine specific, as different ISAs have different capabilities when it comes to dealing with immediate constants, or combining shifts with ALU operations. Here is an alternative implementation with good instruction-level parallelism that may be superior on platforms with a very fast integer multiply. Integer multiply is often helpful for bit twiddling algorithms by performing multiple shift-add operations in parallel.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

/* reference implementation */
uint32_t bcd_spread_1 (uint32_t a)
{
    return (((a & 0xF000) << 12) |
            ((a & 0x0F00) <<  8) |
            ((a & 0x00F0) <<  4) |
            ((a & 0x000F) <<  0));
}

/* alternative implementation */
uint32_t bcd_spread_2 (uint32_t a)
{
    return ((((a & 0xf0f0) * 0x1010) & 0x0f000f00) |
            (((a & 0x0f0f) * 0x0101) & 0x000f000f));
}

/* BCD addition. Knuth TAOCP 4 */
uint32_t median (uint32_t x, uint32_t y, uint32_t z)
{
    return (x & (y | z)) | (y & z);
}

uint32_t bcd_add (uint32_t x, uint32_t y)
{
    uint32_t z, u, t;
    z = y + 0x66666666;
    u = x + z;
    t = median (~x, ~z, u) & 0x88888888;
    return u - t + (t >> 2);
}

int main (void)
{
    uint32_t x, y, bcd = 0;
    do {
        x = bcd_spread_1 (bcd);
        y = bcd_spread_2 (bcd);
        if (x != y) {
            printf ("!!!! bcd=%04x x=%08x y=%08x\n", bcd, x, y);
            return EXIT_FAILURE;
        }
        bcd = bcd_add (bcd, 1);
    } while (bcd < 0x10000);
    return EXIT_SUCCESS;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • Thanks for sharing this interesting technique and for providing the code for a test drive! Your solution seems to have the edge over harold's in terms of both number of operations and path length. For some purposes that would already make it the winner, regardless of what the umpires with the strange names like Ice Lake and Ryzen have to say... – DarthGizka Jan 10 '20 at 10:45
-1

Use the DoubleDabble algorithm.

user448810
  • 17,381
  • 4
  • 34
  • 59
  • 1
    That wouldn't convert 0x1234 to 0x01020304. – interjay Jan 09 '20 at 19:16
  • @ user448810: I've probably learned more from you than from anyone else here on Stack Exchange and I enjoyed many of your exquisite postings, but this one seems to be a bit too cryptic... AFAICS the DoubleDabble is for converting binary to BCD using simple bitwise serial operations (ideal for simple logic and simple CPUs) but for the problem at hand - essentially unpacking nibbles - we want very few very massive (SIMD-like) operations. – DarthGizka Jan 09 '20 at 19:19
  • Thank you for the compliment. And I erred; it is reverse DoubleDabble that converts BCD to binary. The alternative is table lookup. – user448810 Jan 09 '20 at 19:26
  • The question has nothing to do with converting BCD to or from binary. – interjay Jan 09 '20 at 19:27
  • @user448810 double-dabble converts 1234 to 0x1234, not 0x1234 to the expected value – phuclv Jan 10 '20 at 03:20