I'm writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with the suggestion for software to use lookup-tables) on Mike Cowlishaw's web site.
This routine only ever requires the lower 16 bit of the registers it uses, yet for shorter instruction encoding I have used 32 bit instructions wherever possible. Is a speed penalty associated with code like:
mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax
or
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
where the alternative to a 16 bit imul
would be either a 32 bit imul
and a subsequent and
or a series of lea
instructions and a final and
.
The whole code in my routine can be found below. Is there anything in it where performance is worse than it could be due to me mixing word and dword instructions?
.section .text
.type bcd2dpd_mul,@function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jz 1f
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $13,%edi # u = 0000 0000 0000 0aei
imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or tab-6(,%rdi,4),%ax # x = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
1: ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
tab:
.short 0x0011 ; .short 0x000a
.short 0x0000 ; .short 0x004e
.short 0x0081 ; .short 0x000c
.short 0x0008 ; .short 0x002e
.short 0x0081 ; .short 0x000e
.short 0x0000 ; .short 0x006e
.size tab,.-tab
Improved Code
After applying some suggestions from the answer and comments and some other trickery, here is my improved code.
.section .text
.type bcd2dpd_mul,@function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jnz 1f
ret
.align 8
1: and $0x888,%edi # = 0000 a000 e000 i000
imul $0x49,%edi # = 0ae0 aei0 ei00 i000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $8,%edi # = 0000 0000 0ae0 aei0
and $0xe,%edi # = 0000 0000 0000 aei0
movzwl lookup-4(%rdi),%edx
movzbl %dl,%edi
imul %edi,%esi # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or %dh,%al # = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
lookup:
.byte 0x11
.byte 0x0a
.byte 0x00
.byte 0x4e
.byte 0x81
.byte 0x0c
.byte 0x08
.byte 0x2e
.byte 0x81
.byte 0x0e
.byte 0x00
.byte 0x6e
.size lookup,.-lookup