Packing BCD to DPD: How to improve this amd64 assembly routine?

Question

I'm writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with the suggestion for software to use lookup-tables) on Mike Cowlishaw's web site.

This routine only ever requires the lower 16 bit of the registers it uses, yet for shorter instruction encoding I have used 32 bit instructions wherever possible. Is a speed penalty associated with code like:

mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax

or

and $0x888,%edi         #   = 0000 a000 e000 i000
imul $0x0490,%di        #   = aei0 0000 0000 0000

where the alternative to a 16 bit imul would be either a 32 bit imul and a subsequent and or a series of lea instructions and a final and.

The whole code in my routine can be found below. Is there anything in it where performance is worse than it could be due to me mixing word and dword instructions?

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jz 1f

        and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x0490,%di        #   = aei0 0000 0000 0000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $13,%edi            # u = 0000 0000 0000 0aei
        imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or tab-6(,%rdi,4),%ax   # x = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
1:      ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
tab:
        .short 0x0011 ; .short 0x000a
        .short 0x0000 ; .short 0x004e
        .short 0x0081 ; .short 0x000c
        .short 0x0008 ; .short 0x002e
        .short 0x0081 ; .short 0x000e
        .short 0x0000 ; .short 0x006e
        .size tab,.-tab

Improved Code

After applying some suggestions from the answer and comments and some other trickery, here is my improved code.

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz 1f
        ret

        .align 8
1:      and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x49,%edi         #   = 0ae0 aei0 ei00 i000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $8,%edi             #   = 0000 0000 0ae0 aei0
        and $0xe,%edi           #   = 0000 0000 0000 aei0
        movzwl lookup-4(%rdi),%edx
        movzbl %dl,%edi
        imul %edi,%esi          # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or %dh,%al              #   = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
        ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
lookup:
        .byte 0x11
        .byte 0x0a
        .byte 0x00
        .byte 0x4e
        .byte 0x81
        .byte 0x0c
        .byte 0x08
        .byte 0x2e
        .byte 0x81
        .byte 0x0e
        .byte 0x00
        .byte 0x6e
        .size lookup,.-lookup

Please give an explanation when downvoting. How am I supposed to find out what I did wrong without an explanation? — fuz, Dec 05 '15 at 23:14
That 16bit `imul` will take a 2 clock penalty on most µarchs, more on Core2. My suggestion is a 32bit `imul` and a `movzx`. — harold, Dec 05 '15 at 23:52
@harold Is a 32 bit imul + and as fast as a 32 bit imul + movzx? Because with an extra and, I could avoid the two SIB operands by left shifting only by 11. — fuz, Dec 05 '15 at 23:59
Yes I only suggested movzx because it would have been shorter in isolation, unlike movzx out of an 8bit reg it has no serious advantages — harold, Dec 06 '15 at 00:04
It wasn't me, but I can think that the downvote was because you didn't benchmark it yourself :P — Jester, Dec 06 '15 at 01:39

score 8 · Answer 1 · edited Jun 20 '20 at 09:12

TYVM for commenting the code clearly and well, BTW. It made is super easy to figure out what was going on, and where the bits were going. I'd never heard of DPD before, so puzzling it out from uncommented code and the wikipedia article would have sucked.

The relevant gotchas are:

Avoid 16bit operand size for instructions with immediate constants, on Intel CPUs. (LCP stalls)
avoid reading the full 32 or 64bit register after writing only the low 8 or 16, on Intel pre-IvyBridge. (partial-register extra uop). (IvB still has that slowdown if you modify an upper8 reg like AH, but Haswell removes that too). It's not just an extra uop: the penalty on Core2 is 2 to 3 cycles, according to Agner Fog. I might be measuring it wrong, but it seems a lot less bad on SnB.

See http://agner.org/optimize/ for full details.

Other than that, there's no general problem with mixing in some instructions using the operand-size prefix to make them 16-bit.

You should maybe write this as inline asm, rather than as a called function. You only use a couple registers, and the fast-path case is very few instructions.

I had a look at the code. I didn't look into achieving the same result with significantly different logic, just at optimizing the logic you do have.

Possible code suggestions: Switch the branching so the fast-path has the not-taken branch. Actually, it might make no diff either way in this case, or might improve the alignment of the slow-path code.

.p2align 4,,10   # align to 16, unless we're already in the first 6 bytes of a block of 16
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz .Lslow_path
        ret

.p2align 4    # Maybe fine-tune this alignment based on how the rest of the code assembles.    
.Lslow_path:

        ...
        ret

It's sometimes better to duplicate return instructions than to absolutely minimize code-size. The compare-and-branch in this case is the 4th uop of the function, though, so a taken branch wouldn't have prevented 4 uops from issuing in the first clock cycle, and a correctly-predicted branch would still issue the return on the 2nd clock cycle.

You should use a 32bit imul for the one with the table source. (see next section about aligning the table so reading an extra 2B is ok). 32bit imul is one uop instead of two on Intel SnB-family microarches. The result in the low16 should be the same, since the sign bit can't be set. The upper16 gets zeroed by the final and before ret, and doesn't get used in any way where garbage in the upper16 matters while it's there.

Your `imul` with an immediate operand is problematic, though.

It causes an LCP stall when decoding on Intel, and it writes the the low16 of a register that is later read at full width. Its upper16 would be a problem if not masked off (since it's used as a table index). Its operands are large enough that they will put garbage into the upper16, so it does need to be discarded.

I thought your way of doing it would be optimal for some architectures, but it turns out imul r16,r16,imm16 itself is slower than imul r32,r32,imm32 on every architecture except VIA Nano, AMD K7 (where it's faster than imul32), and Intel P6 (where using it from 32bit / 64bit mode will LCP-stall, and where partial-reg slowdowns are a problem).

On Intel SnB-family CPUs, where imul r16,r16,imm16 is two uops, imul32/movzx would be strictly better, with no downside except code size. On P6-family CPUs (i.e. PPro to Nehalem), imul r16,r16,imm16 is one uop, but those CPUs don't have a uop cache, so the LCP stall is probably critical (except maybe Nehalem calling this in a tight loop, fitting in the 28 uop loop buffer). And for those CPUs, the explicit movzx is probably better from the perspective of the partial-reg stall. Agner Fog says something about there being an extra cycle while the CPU inserts the merging uop, which might mean a cycle where that extra uop is issued alone.

On AMD K8-Steamroller, imul imm16 is 2 m-ops instead of 1 for imul imm32, so imul32/movzx is about equal to imul16 there. They don't suffer from LCP stalls, or from partial-reg problems.

On Intel Silvermont, imul imm16 is 2 uops (with one per 4 clocks throughput), vs. imul imm32 being 1 uops (with one per 1 clock throughput). Same thing on Atom (the in-order predecessor to Silvermont): imul16 is an extra uop and much slower. On most other microarchitectures, throughput isn't worse, just latency.

So if you're willing to increase the code-size in bytes where it will give a speedup, you should use a 32bit imul and a movzwl %di, %edi. On some architectures, this will be about the same speed as the imul imm16, while on others it will be much faster. It might be slightly worse on AMD bulldozer-family, which isn't very good at using both integer execution units at once, apparently, so a 2 m-op instruction for EX1 might be better than two 1 m-op instructions where one of them is still an EX1-only instruction. Benchmark this if you care.

Align tab to at least a 32B boundary, so your 32bit imul and or can do a 4B load from any 2B-aligned entry in it without crossing a cache-line boundary. Unaligned accesses have no penalty on all recent CPUs (Nehalem and later, and recent AMD), as long as they don't span two cache lines.

Making the operations that read from the table 32bit avoids the partial-register penalty that Intel CPUs have. AMD CPUs, and Silvermont, don't track partial-registers separately, so even instructions that write-only to the low16 have to wait for the result in the rest of the reg. This stops 16bit insns from breaking dependency chains. Intel P6 and SnB microarch families track partial regs. Haswell does full dual bookkeeping or something, because there's no penalty when merging is needed, like after you shift al, then shift eax. SnB will insert an extra uop there, and there may be a penalty of a cycle or two while it does this. I'm not sure, and haven't tested. However, I don't see a nice way to avoid this.

The shl %al could be replaced with a add %al, %al. That can run on more ports. Probably no difference, since port0/5 (or port0/6 on Haswell and later) probably aren't saturated. They have the same effect on the bits, but set flags differently. Otherwise they could be decoded to the same uop.

changes: split the pext/pdep / vectorize version into a separate answer, partly so it can have its own comment thread.

Hm... both appear in my code, one cannot be removed (the `shl %al; shr %eax` bit). Is the penalty for an LCP stall higher than the time an extra `and` takes? — fuz, Dec 05 '15 at 23:39
@FUZxxl: I'll take a look at the code tomorrow; too sleepy ATM. I just answered the written questions, so far, without looking at the code. What CPU are you asking about? Things are different for every microarch. Also, LCP stalls only happen on decoding, not when uops are coming from the uop cache, so it also matters whether this is being used in a tight loop, or called at intervals between lots of other code. — Peter Cordes, Dec 06 '15 at 03:14
@FUZxxl: Haven't looked at the code yet, but if you're interested in running fast on recent CPUs, you could make a version for BMI2 (Intel Haswell and later). `pextr/pdep` (parallel extract / deposit bits) might be useful here. AMD Piledriver/Steamroller support BMI1 and AMD TBM, but IDK which specific instructions will prove useful. — Peter Cordes, Dec 06 '15 at 12:11
@FUZxxl: I found some improvements. Other than BMI2, the main suggestion is using 32bit imul immediate/movzx for the first imul, and then just using 32bit `imul/or` for the table lookups (and letting the garbage in the upper16 get masked off). I think I'm done making edits for now, so I'll just wait to hear back from you on whether any of this helped. :P — Peter Cordes, Dec 06 '15 at 14:57
WIth `pext` it's possible and quite easy to get right! See updated question for updated code. I'm going to benchmark a `pext` based version soon. — fuz, Dec 06 '15 at 16:05
@FUZxxl: re: `mov tab-4(%rdi), %dx / movzbl %dl, %edx` what the crap? `movzx` from memory doesn't need an ALU port at all on Intel, so that should be `movzbl tab-4(%rdi), %edx`. But are you sure you can't just run `imul tab-4(%rdi), %esi`? The high16 of the table src operand can't affect the low16 of the result, and those bits eventually get masked off. Or pad your table to 4B if I'm missing something. It's tiny enough for that to be worth it. (If you keep the `movzx`-load version, make the `imul` arg a table of bytes, and only keep the zero-padded table of shorts for the `or`.) — Peter Cordes, Dec 06 '15 at 16:16
Oh, now I got to the `or` and I see what you're doing. You're trying to save load-port uops at the expense of more ALU uops. You're probably bottlenecking on ALU or front-end, not the load-port. AMD, and Intel since SnB, can execute two loads per clock. Causing a partial-register slowdown by doing `or %dh, %al` is not at all worth it, if you care about performance on Intel pre-IvB. Small perf loss, but I'm pretty sure it'll be a loss compared to using another memory operand. — Peter Cordes, Dec 06 '15 at 16:18
I could. So that's what you meant... This approach doesn't work any more though as I've reduced the table to one byte per entry and the second least-significant byte could cause the result to change, but let me test first. — fuz, Dec 06 '15 at 16:20
@FUZxxl: IDK if you saw my last comment, since you were writing your own at the same time. Are you optimizing for size, or for speed? There is a tradeoff here, and you keep going for saving a few bytes even when it will cost speed, esp on pre-IvB Intel CPUs where modifying `%al` before using `%eax` has a penalty. You could reorder your last two insns, though: `and $0x3ff, %eax` / `or %dh, %al`, then possibly the caller doesn't touch `%eax` right away and avoids the stall. (Prob. not helpful, but doesn't hurt anything.) — Peter Cordes, Dec 07 '15 at 14:09
I'm optimizing for speed with an eye on code size. Speed-wise, I could probably use an 8192 byte lookup table for the 2¹² possibly inputs but that's going to trash the L1 cache. Implementing this routine is also interesting to me purely from an educational POV, I'd like to learn something about how to write fast assembly and so far your answer has been incredibly helpful. For the stall, that's a good idea as well. — fuz, Dec 07 '15 at 16:16
@FUZxxl: I'd go back to a table of 16b words for the imul lookup. You could use a separate table of bytes for the `or` table, instead of interleaving the imul/or operands. Just align it so all of both tables are in the same 64B cache line. Two separate memory operands will lead to fewer fused-domain uops, which is what counts from a uop-cache perspective. Since the loads micro-fuse into the ALU uops, they're "free" from a uop-cache consumption perspective. Doing a 16bit load with `mov`, then unpacking with `movzx`, is two extra fused-domain uops. — Peter Cordes, Dec 07 '15 at 17:01
I found numbers for the penalty when pre-IvB has to insert an extra uop to merge partial registers: Agner Fog ([in a 2008 post about extending xmm to ymm for AVX, where he talks about the similarity with integer partial-reg stuff](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853)) says it's a 2 to 3 cycle stall. I assume that means in the front-end, like 2 cycles with no uops issued, and one cycle with that single uop issued. So that's missing out on 15 fused-domain uops issuing, in those 3 cycles. Execution port or latency bottlenecks *could* still dominate... — Peter Cordes, Dec 07 '15 at 17:04

Peter Cordes · Accepted Answer · 2015-12-11T18:39:39.867

(I split the BMI2 version into a separate answer, since it could end up totally different)

After seeing what you're doing with that imul/shr to get a table index, I can see where you could use BMI2 pextr to replace and/imul/shr, or BMI1 bextr to replace just the shr (allowing use of imul32, instead of imul16, since you'd just extract the bits you want, rather than needing to shift zeros from the upper16). There are AMD CPUs with BMI1, but even steamroller lacks BMI2. Intel introduced BMI1 and BMI2 at the same time with Haswell.

You could maybe process two or four 16bit words at once, with 64bit pextr. But not for the whole algorithm: you can't do 4 parallel table lookups. (AVX2 VPGATHERDD is not worth using here.) Actually, you can use pshufb to implement a LUT with indices up to 4bits, see below.

Minor improvement version:

.section .rodata
  # This won't won't assemble, written this way for humans to line up with comments.

extmask_lobits:     .long           0b0000 0111 0111 0111
extmask_hibits:     .long           0b0000 1000 1000 1000

# pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
# Load these into regs if running in a tight loop.

#### TOTALLY UNTESTED #####
.text
.p2align 4,,10
bcd2dpd_bmi2:
#       mov   %edi,%eax           #   = 0000 abcd efgh iklm
#       shl   %al                 #   = 0000 abcd fghi klm0
#       shr   %eax                #   = 0000 0abc dfgh iklm

        pext  extmask_lobits, %edi, %eax
                                #   = 0000 0abc dfgh iklm
        mov   %eax, %esi        # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
          # And doesn't waste issue capacity when we're taking the slow path.  CPUs with mov-elimination won't waste execution units from issuing an extra mov
        test  $0x880, %edi        # fast path for a = e = 0
        jnz .Lslow_path
        ret

.p2align 4
.Lslow_path:
        # 8 uops, including the `ret`: can issue in 2 clocks.

        # replaces and/imul/shr
        pext  extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
        and   $0x66, %esi                # q = 0000 0000 0fg0 0kl0
        imul  tab-8(,%rdi,4), %esi       # v = q * tab[u-2][0]
        and   $0x397, %eax               # r = 0000 00bc d00h 0klm
        xor   %esi, %eax                 # w = r ^ v
        or    tab-6(,%rdi,4), %eax       # x = w | tab[u-2][1]
        and   $0x3ff, %eax               #   = 0000 00xx xxxx xxxx
        ret

Of course, if making this an inline-asm, rather than a stand-alone function, you'd change back to the fast path branching to the end, and the slow-path falling through. And you wouldn't waste space with alignment padding mid-function either.

There might be more scope for using pextr and/or pdep for more of the rest of the function.

I was thinking about how to do even better with BMI2. I think we could get multiple aei selectors from four shorts packed into 64b, then use pdep to deposit them in the low bits of different bytes. Then movq that to a vector register, where you use it as a shuffle control-mask for pshufb to do multiple 4bit LUT lookups.

So we could go from 60 BCD bits to 50 DPD bits at a time. (Use shrd to shift bits between registers to handle loads/stores to byte-addressable memory.)

Actually, 48 BCD bits (4 groups of 12bits each) -> 40 DPD bits is probably a lot easier, because you can unpack that to 4 groups of 16bits in a 64b integer register, using pdep. Dealing with the selectors for 5 groups is fine, you can unpack with pmovzx, but dealing with the rest of the data would require bit-shuffling in vector registers. Not even the slow AVX2 variable-shift insns would make that easy to do. (Although it might be interesting to consider how to implement this with BMI2 at all, for big speedups on CPUs with just SSSE3 (i.e. every relevant CPU) or maybe SSE4.1.)

This also means we can put two clusters of 4 groups into the low and high halves of a 128b register, to get even more parallelism.

As a bonus, 48bits is a whole number of bytes, so reading from a buffer of BCD digits wouldn't require any shrd insns to get the leftover 4 bits from the last 64b into the low 4 for the next. Or two offset pextr masks to work when the 4 ignored bits were the low or high 4 of the 64b.... Anyway, I think doing 5 groups at once just isn't worth considering.

Full BMI2 / AVX pshufb LUT version (vectorizable)

The data movement could be:

ignored | group 3        | group 2        | group 1        |  group 0
16bits  | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm

         3   2   1 | 0
pext -> aei|aei|aei|aei  # packed together in the low bits

          2  |      1            |        0
pdep ->  ... |0000 0000 0000 0aei|0000 0000 0000 0aei  # each in a separate 16b word

movq -> xmm vector register.
 (Then pinsrq another group of 4 selectors into the upper 64b of the vector reg).  So the vector part can handle 2 (or AVX2: 4) of this at once

vpshufb xmm2 -> map each byte to another byte (IMUL table)
vpshufb xmm3 -> map each byte to another byte (OR table)


Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:

                  group 3       | group 2             | group 1             |  group 0
pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm

 movq this into a vector reg (xmm1).  (And repeat for the next 48b and pinsrq that to the upper64)

VPAND  xmm1, mask  (to zero aei in each group)

 Then use the vector-LUT results:
VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result

VPAND   xmm1,  mask
VPXOR   xmm1, something
VPOR    xmm1, xmm3

movq / pextrq back to integer regs

pext to pack the bits back together
  You don't need the AND 0x3ff or equivalent:
  Those bits go away when you pext to pack each 16b down to 10b

shrd or something to pack the 40b results of this into 64b chunks for store to memory.
  Or: 32b store, then shift and store the last 8b, but that seems lame
  Or: just do 64b stores, overlapping with the previous.  So you write 24b of garbage every time.  Take care at the very end of the buffer.

Use AVX 3-operand versions of the 128b SSE instructions to avoid needing movdqa to not overwrite the table for pshufb. As long as you never run a 256b AVX instruction, you don't need to mess with vzeroupper. You might as well use the v (VEX) versions of all vector instructions, though, if you use any. Inside a VM, you might be running on a virtual CPU with BMI2 but not AVX support, so it's prob. still a good idea to check both CPU feature flags, rather than assuming AVX if you see BMI2 (even though that's safe for all physical hardware that currently exists).

This is starting to look really efficient. It might be worth doing the mul/xor/and stuff in vector regs, even if you don't have BMI2 pext/pdep to do the bit packing/unpacking. I guess you could use code like the existing non-BMI scalar routing to get selectors, and mask/shift/or could build up the non-selector data into 16b chunks. Or maybe shrd for shifting data from one reg into another?

@FUZxxl: how is your input data packed? One BCD digit per nibble, densely packed? Does the BCD data go in groups of some number of digits, or is there ever only one contiguous number? (If there are groupings, and you don't want to make a group-of-three DPD out of BCD digits from two different groups...) The bit packing/unpacking gets slightly simpler if you only do 4 groups at once (48 BCD bits), instead of 5. (Since then you can use pdep to go straight to the selector for each group in a separate 16b element of a 64b integer register, instead of an unpack after getting into vector regs.) — Peter Cordes, Dec 07 '15 at 17:18
My actual use case for this is that I have a file with 100,000,000,000 decimal digits of pi in packed BCD and want to convert them to DPD to conserve space. The output would be 4 declets per 5 bytes, doing 4 groups at once is fine. However, if this was about getting done I would be finished already, right now it's more about finding the fastest way to do that for me, just for the fun and educational value. — fuz, Dec 07 '15 at 17:21
@FUZxxl: ok perfect. So a big vectorized routine optimized for many BCD groups is exactly what you need. Things would be a lot different if you only ever had like 6 or 20 BCD digits together. I just finished an edit with a viable way to vectorize the whole thing, with PDEP to unpack the non-selector bits from 12 to 16b. — Peter Cordes, Dec 07 '15 at 17:55

Packing BCD to DPD: How to improve this amd64 assembly routine?

Improved Code

2 Answers2

Your `imul` with an immediate operand is problematic, though.

Minor improvement version:

Full BMI2 / AVX pshufb LUT version (vectorizable)

Linked

Packing BCD to DPD: How to improve this amd64 assembly routine?

Improved Code

2 Answers2

Your imul with an immediate operand is problematic, though.

Minor improvement version:

Full BMI2 / AVX pshufb LUT version (vectorizable)

Linked

Your `imul` with an immediate operand is problematic, though.