13

I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero:

00 00 00 00 = 0 0 0 0
00 00 00 01 = 0 0 0 1
00 00 00 10 = 0 0 0 2
00 00 00 11 = 0 0 0 3

I'd like an efficient way to sum the four bitfields. For example:

11 10 01 00 = 3 + 2 + 1 + 0 = 6

A 8-bit lookup table on a modern Intel x64 CPU takes 4 cycles to return an answer from L1. It seems like there should be some way to calculate the answer faster than that. 3 cycles gives about room for 6-12 simple bit operations. As a starter, the straightforward mask and shift look like it will take 5 cycles on Sandy Bridge:

Assuming the bit fields are: d c b a, and that mask is: 00 00 00 11

Clarifying with help from Ira: This presumes that a, b, c, and d are identical and have all been set to the initial byte. Oddly, I think I can do this for free. Since I can do 2 loads per cycle, instead of loading byte once, I can just load it four times: a and d on the first cycle, b and c on the second. The second two loads will be delayed one cycle, but I don't need them until the second cycle. The splits below represent how things should break into separate cycles.

a = *byte
d = *byte

b = *byte
c = *byte

latency

latency

a &= mask
d >>= 6

b >>= 2
c >>= 4
a += d

b &= mask
c &= mask

b += c

a += b

A different encoding for the bitfields to make the logic easier would actually be fine, so long as it fits within a single byte and maps one-to-one with this scheme somehow. Dropping to assembly is also fine. Current target is Sandy Bridge, but targeting Haswell or beyond is fine too.

Application and motivation: I'm trying to make an open-source variable bit decompression routine go faster. Each bitfield represents the compressed length of each of the four integers that follow. I need the sum to know how many bytes I need to jump to get to the next group of four. The current loop takes 10 cycles, with 5 of that coming the lookup I'm trying to avoid. Shaving off a cycle would be ~10% improvement.

Edit:

Originally I said "8 cycles", but as Evgeny points out below I was wrong. As Evgeny points out, the only time there is an indirect 4 cycle load is if loading from the first 2K of system memory without using an index register. A correct list of latencies can be found in the Intel Architecture Optimization Manual Section 2.12

>    Data Type       (Base + Offset) > 2048   (Base + Offset) < 2048 
>                     Base + Index [+ Offset]
>     Integer                5 cycles               4 cycles
>     MMX, SSE, 128-bit AVX  6 cycles               5 cycles
>     X87                    7 cycles               6 cycles 
>     256-bit AVX            7 cycles               7 cycles

Edit:

I think this is how Ira's solution below would break into cycles. I think it also takes 5 cycles of work post load.

a = *byte
b = *byte

latency

latency 

latency

a &= 0x33
b >>= 2

b &= 0x33
c = a

a += b
c += b

a &= 7
c >>= 4

a += c 
phuclv
  • 37,963
  • 15
  • 156
  • 475
Nathan Kurz
  • 1,649
  • 1
  • 14
  • 28
  • Your example is odd: it assumes that the original data byte has somehow been magically replicated into a/b/c/d, and there has to be a cost associated with that. – Ira Baxter Jul 26 '13 at 12:01
  • Yes, it is odd and you are correct to point that out. It probably makes the example much harder to understand than I hoped. But I think I can in fact do it magically and without cost. Explanation added to text. – Nathan Kurz Jul 26 '13 at 19:44
  • How is it possible that "The current loop takes 8 cycles"? It should be 10 cycles: either "4 cycles load + 5 cycles table lookup + 1 cycle pointer update" or "5 cycles load with base+index addressing mode + 5 cycles table lookup" (it is possible to make table lookup only 4 cycles but then this table should start inside the first 2Kb of address space, which is unlikely). – Evgeny Kluev Jul 27 '13 at 13:01
  • I assume you know that the byte is non-zero. You can do that with a test before you process it. Arguably you can test *after* you've computed the sum, but the point is you have to test to leave the loop. So, maybe the problem is finding the sum, knowing it is nonzero. – Ira Baxter Jul 28 '13 at 23:18
  • Evegeny --- You are absolutely right, and my understanding of load times was just wrong. Among other things, I had read Intel's terribly imprecise statement "a base register plus an offset that is smaller than 2048" as Offset < 2048, rather than (Base + Offset) < 2048. I'll fix the text and add a link to right latencies. The current loop is indeed 10 cycles, which means a 4 cycle solution would be a winner rather than a tie. – Nathan Kurz Jul 29 '13 at 07:14
  • Ira -- The sum is allowed to be zero. In the actual use case, there are 4 possible lengths for each compressed integer: 1 byte, 2 bytes, 3 bytes, or 4 bytes; with 00 representing 1 byte and 11 representing 3 bytes. The actual length is thus sum + 4. I'll do the read as a fixed offset from the sum: 0x4(Current, Sum, 1). [Evgeny --- sorry for misspelling your name in the last comment. Just missed the edit window.] – Nathan Kurz Jul 29 '13 at 07:22
  • That's right, it should be "Offset<2048" because "(Base + Offset) < 2048" is practically useless. And I think I made a mistake speaking about 10 cycles. There should be one more, 11 total. That's because you read a byte, then after some processing you use it as a pointer. Which means there should be an additional operation like `movzx` somewhere to avoid partial register stall. Shaving off a cycle would only be 9%. Calculating sum in 4 cycles instead of 5 seems possible. But I think you could achieve more optimizing the whole loop, not just sum of the four. – Evgeny Kluev Jul 29 '13 at 08:51
  • 1
    So the byte count is the bit field value plus 1? "00 representing 1 byte and 11 representing 3 bytes" seem wrong then. I would have expected 11 to represent 4 bytes if I understood you. I think Evgeny is right: why don't you publish the entire loop? Counting cycles on a modern CPU is pretty hard. Have you tried modifying the loop with the alternative solutions to see the actual effect? – Ira Baxter Jul 29 '13 at 14:41
  • Evgeny -- I think you were right the first time. I haven't put in proper cycle counting into the actual routine, but 10 was the number that Intel's IACA calculated, and the number per second that I was measuring also agreed with 10. The difference is probably that since I can do the load with "offset(input, sum)" and update input in the same cycle. – Nathan Kurz Jul 29 '13 at 19:20
  • Ira --- Another stupid late at night error on my part. 11 is indeed 4 bytes, just as you would expect. Counting cycles is indeed hard, but I should just sit down and implement it correctly. I'm mostly using Intel's IACA (very useful tool if you're not familiar with it) to count cycles based on static analysis, and then checking against runtime for the entire decoder. Yes, I've implemented a bunch of these -- and just before I collapsed asleep last night I think I succeeded in getting a popcnt() approach down to 9 cycles for the loop. More to come. – Nathan Kurz Jul 29 '13 at 19:26
  • "A 8-bit lookup table on a modern Intel x64 CPU takes 4 cycles to return an answer from L1." L1 latency is 4 cycles on what?? Your own cycle-by-cycle analysis shows two loads completing with 1-cycle latency. "It seems like there should be some way to calculate the answer faster than that." Perhaps by using SSE registers as 16-entry lookup tables, but four cycles ain't a lot. – Potatoswatter Jul 31 '13 at 01:14
  • Potatoswatter - Per the discussion with Evgeny, it's actually 5 cycles to load an integer from L1 on Sandy Bridge. You might be confusing "latency" with "throughput". You can do two loads in a cycle (throughput), but the answer won't be ready for 5 cycles (latency). Whether 4 cycles is a lot depends on your application, but it's enough for 12 simple operations if you choose the right combination. – Nathan Kurz Jul 31 '13 at 09:46
  • Old discussion, but your "12 simple operations" is also simplifying away throughput vs. latency. It can't be a single dependency chain of 12 operations, where each one reads the output of the previous; that would have 12 cycle latency. (And even if they do have ideal ILP (instruction level parallelism), it would have 12/3 cycle reciprocal throughput on Sandybridge with its 3 ALU ports, much worse than loads which can run at 0.5c throughput, i.e. 2 loads per clock.) – Peter Cordes Apr 01 '21 at 16:54
  • Note that the 4-cycle load-use latency optimistic shortcut attempt only happens when the base register is the result of a another load (e.g. pointer chasing in a linked list or tree, when it's most valuable). [Is there a penalty when base+offset is in a different page than the base?](https://stackoverflow.com/q/52351397) shows that there's a possible downside. – Peter Cordes Apr 01 '21 at 17:33

6 Answers6

6

Would a builtin POPCOUNT instruction help?

n = POPCOUNT(byte&0x55);
n+= 2*POPCOUNT(byte&0xAA)

Or maybe

  word = byte + ((byte&0xAA) << 8);
  n = POPCOUNT(word);

Not sure about the total timing. This discussion says popcount has 3 cycles latency, 1 throughput.


UPDATE:
I may be missing some important fact about how to run IACA, but after a few experiments in the 12-11 throughput range, I compiled the following:

 uint32_t decodeFast(uint8_t *in, size_t count) {
  uint64_t key1 = *in;
  uint64_t key2;
  size_t adv;
  while (count--){
     IACA_START;
     key2=key1&0xAA;
     in+= __builtin_popcount(key1);
     adv= __builtin_popcount(key2);
     in+=adv+4;
     key1=*in;
  }
  IACA_END;
  return key1;
}

with gcc -std=c99 -msse4 -m64 -O3 test.c

and got 3.55 cycles !?!:

Block Throughput: 3.55 Cycles       Throughput Bottleneck: InterIteration
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     |    | popcnt edx,eax
|   1    | 0.9       |     |           |           |     | 0.1 | CP | and eax,0x55 
|   1    |           | 1.0 |           |           |     |     | CP | popcnt eax,eax
|   1    | 0.8       |     |           |           |     | 0.2 |    | movsxd rdx,edx
|   1    | 0.6       |     |           |           |     | 0.4 |    | add rdi, rdx
|   1    | 0.1       | 0.1 |           |           |     | 0.9 | CP | cdqe 
|   1    | 0.2       | 0.3 |           |           |     | 0.6 |    | sub rsi, 1
|   1    | 0.2       | 0.8 |           |           |     |     | CP | lea rdi,[rdi+rax+4] 
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     | CP | movzx eax,[rdi]
|   1    |           |     |           |           |     | 1.0 |    | jnz 0xffff

Two more ideas

A possible Micro-optimization to do the sum in 2 instructions

total=0;
PDEP(vals,0x03030303,*in);  #expands the niblets into bytes
PSADBW(total,vals) #total:= sum of abs(0-byte) for each byte in vals

Latency of each is supposedly 3, so this may not help. Maybe the byte-wise summation addition could be replaced with simple shifts and adds along the lines of AX=total+total>>16; ADD AL,AH

A macro-optimization:
You mention using the key as a lookup into a table of shuffle instructions. Why not just store the distance to the next key along with the shuffle instruction? Either store a bigger table, or possibly squeeze the 4 bit length into unused bits 3-6 of the shuffle key, at the expense of needing a mask to extract it.

AShelly
  • 34,686
  • 15
  • 91
  • 152
  • 3
    Clever. A variant of your second one: POPCOUNT(byte)+POPCOUNT(byte&AA)? At least the two POPCOUNTS can be executed somewhat in parallel. If POPCount is 3 cycle latency, though, he's looking at 4-5 cycles so this doesn't quite work either. – Ira Baxter Jul 28 '13 at 23:07
  • Popcount would be perfect for larger bit sets (say, 10 groups of 3 bits, 16 groups of 4 bits or if you insist 32 groups of 2 bits). Cycle time doesn't improve but you amortize the cost over much bigger groups so you get an apparant speedup that should be in multiples rather than percents. – Ira Baxter Jul 30 '13 at 00:14
  • First two pieces of good news: I'm able to reproduce your result with IACA, and I'm even able to get it down to a claimed 3.15 cycles by using only 64-bit registers so I can skip the `cdqe` and `movsxd`. Now the bad news: as you probably expect, this is almost certainly a bug with IACA. Since the load takes 5, 3.x doesn't seem possible. And if I switch the order of the LEA and MOVZX at the end (adjusting the load address) I get the expected 9 cycles. I think your usage is correct, and fear the tool is buggy. Intel is quite responsive on their help forum if you want to report it. – Nathan Kurz Jul 31 '13 at 05:37
  • It may be a bug in IACA, but I can consistently get variations with reported throughput in the low 3's and latency around 8. Which got me thinking that throughput of a single pass may not be the best metric, and this may be too artificial a test case. You state that the latency of decoding doesn't matter, but at the very least, the keys need to be passed off to the decoder, right? Shouldn't you include that mechanism in your test case? It seems odd to micro-optimize a loop that needs more instructions added. And working with more code gives more opportunity to shuffle things... – AShelly Aug 01 '13 at 12:56
  • Apart from bugs, the throughput for IACA is based on a steady state assuming a loop around the code. Knowing that the latency for the load is 5, and that the rest of the loop is dependent on the load, I'm certain the reported throughput is a bug. Yes, it's possible I should have started out with the complete loop. I hoped the more general question of summing bitfields would be more approachable, but am not sure that was the right choice. I've been adding back the decoding when I test, and so far it hasn't affected the cycle count. – Nathan Kurz Aug 01 '13 at 19:11
  • I agree it's an IACA bug. And I think you made the right choice with the initial question. I was just thinking that for the optimization challenge in your wiki answer, a more complete example might yield more opportunities for speedup. – AShelly Aug 01 '13 at 21:00
  • I'm new to StackOverflow, and still trying to figure out the community dynamics. Yes, I'll try to post the complete routine once it feels like we've reached an optimum for this part of the problem. There certainly may be better ways to combine the two than I have. Thanks! – Nathan Kurz Aug 01 '13 at 23:14
  • Just saw your edits. PDEP might be a good approach, but not available for Sandy Bridge. It's part of BMI2 which came with Haswell. I don't think there is a good way to embed the advance length with the shuffle operand, but it's good to consider. The operand is a 16-byte XMM vector, which is the most I can load for SSE. Embedding it into the operand might be possible, but would add at least to cycle to remove before use. Extra loads aren't much problem as long as we are willing to presume they are from L1, which the tables will be. – Nathan Kurz Aug 06 '13 at 10:41
  • This probably fell afoul of SnB-family's false (output) dependency for `popcnt` - [Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with \_mm\_popcnt\_u64 on Intel CPUs](https://stackoverflow.com/a/25089720) / [Why does breaking the "output dependency" of LZCNT matter?](https://stackoverflow.com/a/47234390). That microarchitectural pitfall was only discovered after this answer was posted, so compilers weren't working around it: looks like this asm has a 4-cycle loop-carried dependency through EDX (popcnt+movsxd). – Peter Cordes Apr 01 '21 at 17:10
  • (The wasted `movsxd` and `cdqe` are from widening the int return values of `__builtin_popcount` to the width of uint64_t, instead of casting it to unsigned so it can zero-extend for free.) – Peter Cordes Apr 01 '21 at 17:11
  • If you're summing 2-bit counters across multiple contiguous bytes, you want to work in `uint64_t` chunks and pick out all 32 ones-place and twos-place bits, not load + mask each byte separately. i.e. this should go 8x faster with `typedef uint64_t aliasing_u64 __attribute__((may_alias,aligned(1)));`. (Or with memcpy instead to do an aliasing and alignment-safe unaligned load.) – Peter Cordes Apr 01 '21 at 17:17
  • (oh, but that's *not* what this is doing; I just read the question and code more carefully. Maybe other future readers would want that.) – Peter Cordes Apr 01 '21 at 17:50
5

Other answers propose various means to add together values sitting in single variable (without unpacking them). While these approaches give pretty good throughput (with POPCNT in particular), they have large latency - either because long computation chains or because high-latency instructions used.

It may be better to use normal addition instructions (adding together one pair of values at once), use simple operations like masks and shifts to split these values one from another, and use instruction-level parallelism to do this efficiently. Also the position of two middle values in the byte hints for a variant of table lookup that uses a single 64-bit register instead of memory. All this allows to speed up calculation for sum of the four and use only 4 or 5 clocks.

Original table lookup approach suggested in OP may consist of the following steps:

  1. load byte with four values from memory (5 clocks)
  2. calculate sum of the values using lookup table (5 clocks)
  3. update pointer (1 clock)

64-bit register lookup

The following snippet shows how to perform step #2 in 5 clocks and also combine steps #2 and #3 keeping latency still at 5 clocks (which could be optimized to 4 clocks with complex addressing mode for memory load):

p += 5 + (*p & 3) + (*p >> 6) +
  ((0x6543543243213210ull >> (*p & 0x3C)) & 0xF);

Here constant "5" means that we skip current byte with lengths as well as 4 data bytes corresponding to all-zero lengths. This snippet corresponds to the following code (64-bit only):

mov eax, 3Ch
and eax, ebx              ;clock 1
mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
mov rcx, 6543543243213210h
shr rcx, eax              ;clock 2..3
and ecx, Fh               ;clock 4
add rsi, 5
add rsi, rbx              ;clock 3 or 4
movzx ebx, [rsi + rcx]    ;clock 5..9
add rsi, rcx

I tried to produce this code automatically with following compilers: gcc 4.6.3, clang 3.0, icc 12.1.0. First two of them didn't do anything good. But Intel's compiler did the work almost perfectly.


Fast bitfield extraction with ROR instruction

Edit: Nathan's tests reveal a problem with following approach. ROR instruction on Sandy Bridge uses two ports and conflicts with SHR instruction. So this code needs 1 more clock on Sandy Bridge which makes it not very useful. Probably it would work as expected on Ivy Bridge and Haswell.

It is not necessary to use the trick with 64-bit register as a lookup table. Instead you could just rotate the byte by 4 bits which places two middle values to the positions of the first and the fourth values. Then you could process them the same way. This approach has at least one disadvantage. It is not so easy to express byte rotation in C. Also I'm not quite sure about this rotation because on older processors it may result in partial register stall. Optimization manual hints that for Sandy Bridge we could update part of the register if operation source is the same as destination, without the stall. But I'm not sure I understood it properly. And I have no proper hardware to check this. Anyway, here is the code (now it may be either 32-bit or 64-bit):

mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
ror al, 4                 ;clock 1
mov ecx, 3
and ecx, eax              ;clock 2
shr eax, 6                ;clock 2
add eax, ecx              ;clock 3
add esi, 5
add esi, ebx              ;clock 3
movzx ebx, [esi+eax]      ;clocks 4 .. 8
movzx eax, [esi+eax]      ;clocks 4 .. 8
add esi, eax

Using boundary between AL and AH to unpack bitfields

This method differs from the previous one only in the way two middle bitfields are extracted. Instead of ROR, which is expensive on Sandy Bridge, a simple shift is used. This shift positions second bitfield in register AL and third bitfield - in AH. Then they are extracted with shifts/masks. Like in previous method, there are possibilities for partial register stall here, now in two instructions instead of one. But it's very likely Sandy Bridge and newer processors can execute them without delay.

mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
shl eax, 4                ;clock 1
mov edx, 3
and dl, ah                ;clock 2
shr al, 6                 ;clock 2
add dl, al                ;clock 3
add esi, 5
add esi, ebx              ;clock 3
movzx ebx, [esi+edx]      ;clock 4..8
movzx eax, [esi+edx]      ;clock 4..8
add esi, edx

Load and compute sum in parallel

Also it is not necessary to load 4-lengths byte and to calculate the sum sequentially. You could do all these operations in parallel. There are only 13 values for sum of the four. If your data is compressible, you'll rarely see this sum greater than 7. Which means instead of loading a single byte, you could load first 8 most likely bytes to 64-bit register. And you could do it earlier than computing sum of the four. 8 values are loaded while the sum is calculated. Then you just get proper value from this register with shift and mask. This idea may be used together with any means for computing the sum. Here it is used with simple table lookup:

typedef unsigned long long ull;
ull four_lengths = *p;
for (...)
{
  ull preload = *((ull*)(p + 5));
  unsigned sum = table[four_lengths];
  p += 5 + sum;

  if (sum > 7)
    four_lengths = *p;
  else
    four_lengths = (preload >> (sum*8)) & 15;
}

With proper assembly code this adds only 2 clocks to latency: shift and mask. Which gives 7 clocks (but only on compressible data).

If you change table lookup to computations, you may get loop latency of only 6 clocks: 4 to add together values and update the pointer, and 2 for shift and mask. Interestingly, in this case loop latency is determined only by computations and does not depend on latency for memory load.


Load and compute sum in parallel (deterministic approach)

Performing load and summation in parallel may be made in deterministic way. Loading two 64-bit registers and then selecting one of them with CMP+CMOV is one possibility, but it does not improve performance over sequential computation. Other possibility is using 128-bit registers and AVX. Migrating data between 128-bit registers and GPR/memory adds significant amount of latency (but half of this latency may be removed if we process two data blocks per iteration). Also we'll need to use byte-aligned memory loads to AVX registers (which also adds to loop latency).

The idea is to perform all computations in AVX except for memory load which should be done from GPR. (There is an alternative to do everything in AVX and use broadcast+add+gather on Haswell, but it's unlikely to be faster). Also it should be helpful to alternate data load to a pair of AVX registers (to process two data blocks per iteration). This allows pairs of load operations to overlap partially and cancels out half of additional latency.

Start with unpacking proper byte from register:

vpshufb xmm0, xmm6, xmm0      ; clock 1

Add together four bitfields:

vpand xmm1, xmm0, [mask_12]   ; clock 2 -- bitfields 1,2 ready
vpand xmm2, xmm0, [mask_34]   ; clock 2 -- bitfields 3,4 (shifted)
vpsrlq xmm2, xmm2, 4          ; clock 3 -- bitfields 3,4 ready
vpshufb xmm1, xmm5, xmm1      ; clock 3 -- sum of bitfields 1 and 2
vpshufb xmm2, xmm5, xmm2      ; clock 4 -- sum of bitfields 3 and 4
vpaddb xmm0, xmm1, xmm2       ; clock 5 -- sum of all bitfields

Then update address and load next vector of bytes:

vpaddd xmm4, xmm4, [min_size]
vpaddd xmm4, xmm4, xmm1       ; clock 4 -- address + 5 + bitfields 1,2
vmovd esi, xmm4               ; clock 5..6
vmovd edx, xmm2               ; clock 5..6
vmovdqu xmm6, [esi + edx]     ; clock 7..12

Then repeat the same code once more, only using xmm7 instead of xmm6. While xmm6 is loaded we may process xmm7.

This code uses several constants:

min_size = 5, 0, 0, ...
mask_12 = 0x0F, 0, 0, ...
mask_34 = 0xF0, 0, 0, ...
xmm5 = lookup table to add together two 2-bit values

Loop implemented as described here needs 12 clocks to complete and 'jumps' two data blocks at once. Which means 6 cycles per data block. This number may be too optimistic. I'm not pretty sure that MOVD needs only 2 clocks. Also it's not clear what is latency of MOVDQU instruction performing unaligned memory load. I suspect that MOVDQU has very high latency when data crosses cache-line boundary. I suppose this means something like 1 additional clock of latency on average. So about 7 cycles per data block is more realistic estimation.


Using brute force

Jumping just one or two data blocks per iteration is convenient but does not fully use resources of modern processors. After some pre-processing we may implement jumping straight to first data block in the next aligned 16 bytes of data. Pre-processing should read the data, compute sum of the four fields for each byte, use this sum to compute "links" to the next four-byte fields, and finally follow these "links" up to the next aligned 16-byte block. All these computations are independent and may be computed in any order using SSE/AVX instruction set. AVX2 would do pre-processing two times faster.

  1. Load 16 or 32 byte data block with MOVDQA.
  2. Add together 4 bitfields of each byte. To do this, extract high and low 4-bit nibbles with two PAND instructions, shift high nibble with PSRL*, find sum of each nibble with two PSHUFB, and add two sums with PADDB. (6 uops)
  3. Use PADDB to compute "links" to the next four-field bytes: add constants 0x75, 0x76, ... to the bytes of XMM/YMM register. (1 uop)
  4. Follow the "links" with PSHUFB and PMAXUB (more expensive alternative to PMAXUB is combination of PCMPGTB and PBLENDVB). VPSHUFB ymm1, ymm2, ymm2 does almost all the work. It substitutes "out-of-bound" values with zero. Then VPMAXUB ymm2, ymm1, ymm2 restores original "links" in place of these zeros. Two iterations is enough. After each iteration distance for each "link" is twice as large, so we need only log(longest_chain_length) iterations. For example, the most lengthy chain 0->5->10->15->X will be compressed to 0->10->X after one step and to 0->X after two steps. (4 uops)
  5. Subtract 16 from each byte with PSUBB, and (for AVX2 only) extract high 128 bits to a separate XMM register with VEXTRACTI128. (2 uops)
  6. Now pre-processing is finished. We may follow the "links" to first data block in the next 16-byte piece of data. This might be done with PCMPGTB, PSHUFB, PSUBB, and PBLENDVB. But if we assign range 0x70 .. 0x80 for possible "link" values, a single PSHUFB will do the work properly (actually a pair of PSHUFB, in case of AVX2). Values 0x70 .. 0x7F select proper byte from the next 16-byte register while value 0x80 would skip next 16 bytes and load byte 0, which is exactly what is needed. (2 uops, latency = 2 clocks)

Instructions for these 6 steps do not need to be ordered sequentially. For example, instructions for steps 5 and 2 may stand next to each other. Instructions for each step should process 16/32-byte blocks for different stages of pipeline, like this: step 1 processes block i, step 2 processes block i-1, steps 3,4 process block i-2, etc.

The whole loop's latency might be 2 clocks (per 32 bytes of data). But limiting factor here is throughput, not latency. When AVX2 is used we need to execute 15 uops, which means 5 clocks. If data is not compressible and data blocks are large, this gives about 3 clocks per data block. If data is compressible and data blocks are small, this gives about 1 clock per data block. (But since MOVDQA latency is 6 clocks, to get 5 clocks per 32 bytes we need two overlapping loads and to process twice as much data in each loop).

Pre-processing steps are independent of step #6. So they may be performed in different threads. This may decrease time per 32 bytes of data below 5 clocks.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • I love the 64 bit lookup table :-} – Ira Baxter Jul 29 '13 at 19:54
  • Yes, the 64-bit lookup table is wonderful, and I hadn't considered it. Assembly only is no problem --- it's so hard to get a compiler to produce what I want that I'll probably be using a lot of inline assembly anyway. Regarding compressible: it will _not_ be the case that sums less than 7 are more likely. The compressed numbers are likely to already be "delta compressed", and represent the differences between the original numbers. The average magnitude of this difference, and thus the average sum, is not known ahead of time. You have lots of ideas here: I'll start working through them. – Nathan Kurz Jul 29 '13 at 21:04
  • I think a solution along the lines of your 3rd suggestion will need to be branch-free. I've been playing with a very similar approach, but loading a 16-byte vector instead of an 8-byte subset. The question then becomes how to calculate the sum and extract a single "key" byte from a vector faster than two lookups (10 cycles). So far, I can match that 10 but not do better: lookup shuffle operand from a table with "key", PSHUFB, MOVD xmm->key. I'll submit it as answer so I can format it correctly. – Nathan Kurz Jul 29 '13 at 21:37
  • @NathanKurz: Branch-free will add CMP+CMOV to latency which means 3 more cycles, total 9 cycles. Equal to my 1st and 2nd suggestions. By the way, if your measurements are correct, this means optimization manual is wrong and base+index addressing on SB uses only 4 cycles instead of 5. Then my 1st and 2nd suggestions may be optimized to 8 cycles with updating the pointer and loading at the same time. And I think using AVX/SSE vectors is a dead end because corresponding instructions have large latency. – Evgeny Kluev Jul 30 '13 at 06:59
  • Evgeny - By branch-free I didn't mean CMOV, just that it needed to be a solution without unpredictable branches. Agree that CMOV is not helpful here. Not certain about the measurements, but certain that they get faster in the order indicated, and that IACA reports the numbers given. But why do you think this implies 4 cycles for loads? I just counted again, and think that 5 cycles fits for all of them once one accounts for speculative execution continuing the loop before the decrement is done. Also, which AVX/SSE latency are you thinking of? – Nathan Kurz Jul 30 '13 at 19:42
  • @NathanKurz: I tried other branch-free approaches and found that CMOV is better than alternatives, still not good enough. About 4 cycles: comparing your tests with latencies 11 and 10, I see that you removed "add" from the loop (-1cycle) but used base+index addressing instead (+1cycle?). It seems complex addressing didn't add to loop latency, load is still 4 cycles. So either optimization manual is wrong or your processor is not Sandy Bridge. As for AVX/SSE, every instruction has additional 1 clock latency when loading/storing from/to memory or GPR register. That's the problem. – Evgeny Kluev Jul 30 '13 at 20:19
  • From the manual I quoted in the original answer, I think [base] and [base + index] and [base + index + offset] all take the same time: 5 cycles for GPR/8B, 6 cycles for XMM/16B, 7 cycles for YMM/32B. On Sandy Bridge and Haswell, Intel manual says data from XMM/YMM to GPR takes 2 cycles, but only 1 cycle to load XMM/YMM from GPR (Table C-12a). Agner's docs say 1 in both direction, but I think he's wrong. It's possible this is too much, but I think we need all 16 bytes loaded. I'm pretty sure it's Sandy Bridge E5-1620. Not my machine, but shall I see if I can get you access to it? – Nathan Kurz Jul 30 '13 at 21:22
  • If all GPR loads were 5 cycles, then you should see latency 12 for your first testcase: 2*5 loads + 1 movzx + 1 add. In fact it is only 11. Which means one of the loads should be only 4. Anyway, I think this question is not so important to bother arranging access. – Evgeny Kluev Jul 30 '13 at 22:53
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/34497/discussion-between-nathan-kurz-and-evgeny-kluev) – Nathan Kurz Jul 30 '13 at 23:58
  • Evgeny -- I've added your 2nd solution to my answer, but haven't been able to get it down to 9 cycles yet. I think the problem is that ROR needs both P1 and P5 during the same cycle to have latency 1. Do you see a way to fix it? I haven't tried your first yet. – Nathan Kurz Jul 31 '13 at 04:18
  • @NathanKurz: I updated the answer and continued discussion in chat. – Evgeny Kluev Jul 31 '13 at 13:08
  • *Also it's not clear what is latency of MOVDQU instruction performing unaligned memory load.* It may not be that bad. Sandy Bridge is usually said to have 0 penalty for unaligned loads. Optimization 2.2.5.2 suggests this by not listing unaligned as a reason for delay. http://www.7-cpu.com/cpu/SandyBridge.html says 5 cycles extra for split cache line, 25 cycles extra for crossing a 4K page boundary. Nehalem similar (Table 2-25). Unaligned **stores** across page boundary still really bad: 150 cycles (11.6.3). Pretty sure MOVD is 1 cycle GPR->XMM, 2 cycles XMM->GPR. – Nathan Kurz Aug 04 '13 at 08:08
  • Oh weird, I though `ROR` was single-uop on Sandybridge (because that's how Agner Fog lists ror r,i), but turns https://uops.info/ measures 2 uops even for the immediate-count form (which unlike the implicit-1 form doesn't have to modify OF while leaving other flags of the SPAZO group unmodified). But yeah, according to https://uops.info/, single-uop ROR was new with IvyBridge. (Only for the immediate-count form, not CL or by-1) – Peter Cordes Apr 01 '21 at 17:05
4

Consider

 temp = (byte & 0x33) + ((byte >> 2) & 0x33);
 sum = (temp &7) + (temp>>4);

Should be 9 machine instructions, many of them executed in parallel. (OP's first try is 9 instructions plus some moves not mentioned).

On inspection, this seems to have too many serial dependencies to be a win.

EDIT: The discussion about binary-ops being destructive, and LEA avoiding this, got me to thinking about how to use LEA to combine more than one operand, and multiplication by constants. The above code tries to right normalize the answer by shifting right, but we can left-normalize the answer by multiplying. With that insight, this code might work:

     mov     ebx,  byte      ; ~1: gotta start somewhere
     mov     ecx, ebx        ; ~2: = byte
     and     ebx, 0xCC       ; ~3: 2 sets of 2 bits, with zeroed holes
     and     ecx, 0x33       ; ~3: complementary pair of bits
     lea     edx, [ebx+4*ecx] ; ~4: sum bit pairs, forming 2 4-bit sums
     lea     edi, [8*edx+edx] ; ~5: need 16*(lower bits of edx)
     lea     edi, [8*edx+edi] ; ~6: edi = upper nibble + 16* lower nibble
     shr     edi, 4           ; ~7: right normalized
     and     edi, 0x0F        ; ~8: masked

Well, entertaining but still didn't pan out. 3 clocks isn't very long :-{

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Innovative, thanks! As you suspect, I think the dependencies slow it down. Regarding copies: at the assembly level (which this will compile to) 'and', 'add', and 'shift' are all destructive to one of their operands (equivalent to &=, +=, >>=). This means that if you need 'a' and 'b' to remain the same, 'sum = a + b' really gets implemented as 'c = a; c += b' or 'c = b; c += a', adding another cycle unless you have a spare copy by some other means. – Nathan Kurz Jul 26 '13 at 20:21
  • I'm pretty familiar with x86 assembly code. In general, binary operations are destructive to one of the input operands. However, the LEA instruction lets you add two operands (with some constant multiplies thrown in) and place the answer in a third register. Unfortunately, there are no opportunities to use that here. – Ira Baxter Jul 26 '13 at 20:44
  • I was even optimizing ADD to LEA in a different window for a different problem, and didn't even even think of applying that here! While I can't see a great way to make use of it, certainly worth exploring. – Nathan Kurz Jul 26 '13 at 21:39
  • Is there a reason you are limit your bit patterns to 4 bit-pairs in one byte? Why not 8 pairs in 16 bits, or 5 bit-triples in 16 bits? The reason I ask is that have to do a lot of to process a very small amount of information and that leaves you almost no time. If you have a bigger chunk of bits to process, you might have more time per set; you can use more complex schemes and amortize and still win. Consider my first solution is really a binary adder tree in disguise. You can make a deeper tree with more bits in not many more cycles, and maybe that's a win. – Ira Baxter Jul 28 '13 at 23:24
  • This particular attempt is to optimize an existing data format that has 1 byte of widths followed by a payload of 4 compressed integers (4 to 16 bytes). The actual decode depends on the ability to fit the payload in a single vector. With 32-byte AVX vectors, 8 pairs in 16 bits would make sense, but it would require AVX2 (Haswell) to do the bit manipulation. On the other hand, although the extra bits are cheap to process, the 5-cycle table lookup stays constant. – Nathan Kurz Jul 29 '13 at 19:35
  • 1
    So, on a becoming-widespread implementation, same cycle count, more processed per cycle. By the time your version matters to any significant set of users, Haswell (and AMDs response) will be widespread. You should target that, not what what people already have. Current machines will run well with your simple implementation; Haswell machines will run better :-} – Ira Baxter Jul 29 '13 at 19:51
  • Agree on targetting the future, so I have a strong preference for solutions that scale to wider vectors. Popcnt looks like it does. I also just like the challenge. But right now I'm helping a friend write an academic paper on this, and for comparison sake we need a non-AVX implementation. Link to his previous paper and source is here: http://lemire.me/blog/archives/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/ In the paper, this scheme is called "Varint-GB", and there's a better description there than I've managed to give. – Nathan Kurz Jul 29 '13 at 21:14
  • 1
    @NathanKurz: old question, but worth pointing out that you don't *need* AVX2 to work with wider data, just two XMM vectors. `vpshufb ymm` is two in-lane shuffles anyway, so it's not particularly more powerful than 2x `vpshufb xmm`, if that's what you're getting at. So whether you have AVX2, AVX1, or SSSE3, you probably want to sum the first four 2-bit fields in parallel with summing all 8 to get an offset for loading the 2nd vector of 16B (or with AVX2, for `vinserti128 ymm, mem, 1` high half), and index a 256-entry table of shuffle-control vectors twice to load. – Peter Cordes Apr 01 '21 at 17:59
3

I have no idea how many cycles it could take, and I might be completely off, but it is possible to sum with 5 simple operations using 32bits multiplications:

unsigned int sum = ((((byte * 0x10101) & 0xC30C3) * 0x41041) >> 18) & 0xF;

The first multiplication repeats the bit pattern

abcdefgh -> abcdefghabcdefghabcdefgh

The first bitand keeps a pair every 6 bits:

abcdefghabcdefghabcdefgh -> 0000ef0000cd0000ab0000gh

The second multiplication sums the bits pattern (only yyyy is of interest)

                     0000ef0000cd0000ab0000gh
             + 0000ef0000cd0000ab0000gh000000
       + 0000ef0000cd0000ab0000gh000000000000
 + 0000ef0000cd0000ab0000gh000000000000000000
 --------------------------------------------
   ..................00yyyy00................

The last 2 ops shift the yyyy to the right and cut the left part

The main problem is that ops are sequential though...

EDIT

Or just translate the whole thing 10 bits to the left and remove last bitand:

unsigned int sum = (((byte * 0x4040400) & 0x30C30C00) * 0x41041) >> 28;
aka.nice
  • 9,100
  • 1
  • 28
  • 40
  • Beautiful and promising approach, but I'm not sure if we can make it fast. 'MUL' and 'IMUL' (unsigned and signed multiply) have a latency of 3 cycles. So if we have two serial multiplications, I think we are already at 6 cycles. I'm not sure yet what 'zzz' is suggesting, but it's possibly related to this. He's loading the high 8-bits and low 8-bits of the register separately, which might be another way of creating a similar arrangement. Or maybe not. – Nathan Kurz Jul 31 '13 at 01:17
2

There are lots of great ideas here, but it's getting hard to find them amidst the discussion. Let's use this answer to offer final solutions along with their timing. Please feel free to edit this post and add your own along with timing. If unsure about the timing paste in the code at the bottom and I'll measure it. x64 assembly would be best. I'll happily compile C, but rarely have good results at this level of optimization without lots of tweaking.

Overview

Rephrasing the question to put it into proper context: The goal is to rapidly decode an integer compression format known at "Varint-GB" (or Group Varint). Among other places, it's described in a paper by Daniel Lemire and Leo Boytsov.. I made comments on the first version of this paper of the standard "clearly the author is an idiot" style, and Daniel (the main author of the paper, and not so much an idiot) cunningly roped me in to help code for a followup.

Standard Varint (aka VByte) has a flag at the beginning of each byte determining if it the end of the integer, but this is slow to parse. This version has a single byte 'key' and then 4 compressed integers of payload. The key consists of 4 2-bit fields, each representing the byte-length of the compressed integers that follow. Each can be 1 byte (00), 2 bytes (01), 3 bytes (10), or 4 bytes (11). Each 'chunk' is therefor 5 to 17 bytes long, but always encodes the same number (4) of 32-bit unsigned integers.

Sample Chunk:
  Key:  01 01 10 00  
  Data: [A: two bytes] [B: two bytes] [C: three bytes] [D: one byte]
Decodes to: 00 00 AA AA   00 00 BB BB   00 CC CC CC  00 00 00 DD

The key is an index into a table of 16-byte shuffle patterns, and the actual decoding is done by shuffling the data bytes into the correct spacing with PSHUFB.

vec_t Data = *input
vec_t ShuffleKey = decodeTable[key]     
VEC_SHUFFLE(Data, ShuffleKey) // PSHUFB
*out = Data

In reality, there is usually usually a "delta decoding" step as well, since the original integers generally have been made smaller by compressing the "delta" (difference) between the integers rather than the integers themselves. But the latency for the decode routine usually doesn't matter, since the next iteration is not depending on it.

The Problem Restated

The problem specified here is to to hop from one 'key' to the next. Since there are no dependencies on the decoded Data here (only on the key) I'll ignore the actual decoding and just concentrate on the loop that reads the keys. The function takes a pointer to a key and a count n, and returns the n'th key.

11 cycles

The 'basic' approach is to use a lookup table of 'advance' offsets with the key as index. Lookup any of the 256 keys in offsetTable to get the precalculated (sum + 1) offset to advance. Add that to the current input position, and read the next key. According to Intel's IACA, this loop takes 11 cycles on Sandy Bridge (a cycle counts below on Sandy Bridge as well).

uint32_t decodeBasic(uint8_t *in, size_t count) {
    uint64_t key, advance;
    for (size_t i = count; i > 0; i--) {
        key = *in;
        advance = offsetTable[key];
        in += advance;
    }
    return key;
}

0000000000000000 <decodeBasic>:
   0:   test   %rsi,%rsi
   3:   je     19 <decodeBasic+0x19>
   5:   nopl   (%rax)
   8:   movzbl (%rdi),%eax
   b:   add    0x0(,%rax,8),%rdi
  13:   sub    $0x1,%rsi
  17:   jne    8 <decodeBasic+0x8>
  19:   repz retq 

Block Throughput: 11.00 Cycles       Throughput Bottleneck: InterIteration
   0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
--------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [rdi]
| 0.3       | 0.3 |           | 1.0   1.0 |     | 0.3 | CP | add rdi, qword ptr [rax*8]
|           |     |           |           |     | 1.0 |    | sub rsi, 0x1
|           |     |           |           |     |     |    | jnz 0xffffffffffffffe7

10 cycles

From there, we can get down to 10 cycles by rearranging the loop so that we add update the input pointer and start loading the next key at the same time. You may note that I had to use inline assembly to 'encourage' the compiler to produce the output I wanted. I'll also start dropping the outer loop as it (usually) stays the same.

key = *in;
advance = offsetTable[key]; 
for (size_t i = count; i > 0; i--) {
    key = *(in + advance);
    ASM_LEA_ADD_BASE(in, advance);
    advance = offsetTable[key];
}

Block Throughput: 10.00 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [rdi+rdx*1]
| 0.5       | 0.5 |           |           |     |     |    | lea rdi, ptr [rdi+rdx*1]
|           |     |           | 1.0   1.0 |     |     | CP | mov rdx, qword ptr [rax*8]
|           |     |           |           |     | 1.0 |    | sub rsi, 0x1
|           |     |           |           |     |     |    | jnz 0xffffffffffffffe2

9 cycles

I'd tried to use POPCNT earlier, but without the suggestions, hints, and ideas from Ira and AShelly I wasn't having much luck. But putting together the pieces, I think I have something that runs the loop in 9 cycles. I've put it into the actual decoder, and the number of Ints/s seems to agree with this. This loop is essentially in assembly, since I couldn't get the compiler to do what I wanted otherwise, muchless multiple compilers.

[Edit: removed extra MOV per comment by AShelly]

uint64_t key1 = *in;
uint64_t key2 = *in;
for (size_t i = count; i > 0; i--) {
    uint64_t advance1, advance2;
    ASM_POPCOUNT(advance1, key1);
    ASM_AND(key2, 0xAA);

    ASM_POPCOUNT(advance2, key2);
    in += advance1;

    ASM_MOVE_BYTE(key1, *(in + advance2 + 4));
    ASM_LOAD_BASE_OFFSET_INDEX_MUL(key2, in, 4, advance2, 1);        
    in += advance2;
 }


Block Throughput: 9.00 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           | 1.0 |           |           |     |     | CP | popcnt r8, rax
| 1.0       |     |           |           |     |     | CP | and rdx, 0xaa
|           |     |           |           |     | 1.0 | CP | add r8, rdi
|           | 1.0 |           |           |     |     | CP | popcnt rcx, rdx
|           |     | 1.0   1.0 |           |     |     | CP | movzx rax, byte ptr [rcx+r8*1+0x4]
|           |     |           | 1.0   1.0 |     |     | CP | mov rdx, qword ptr [r8+rcx*1+0x4]
| 1.0       |     |           |           |     |     |    | lea rdi, ptr [rcx+r8*1]
|           |     |           |           |     | 1.0 |    | dec rsi
|           |     |           |           |     |     |    | jnz 0xffffffffffffffd0

As an indication of the intricacy of the moving parts in modern processors, I had an interesting experience with a variation of this routine. If I combine the second mov rax line with the and rax, 0xaa by specifying a memory location with and (mov rax, 0xAA; and rax, qword ptr [r8+rcx*1+0x4]), I ended up with a routine that vacillated 30% run to run. I think this is because sometimes the initial conditions leading up to the loop would cause the 'and' micro-op of the load/and to run before the POPCNT for the entire loop.

8 cycles

Anyone?

Evgeny

This is my attempt to implement Evgeny's solution. I haven't been able to get it down to 9 cycles yet, at least for IACA's model of Sandy Bridge (which has been accurate so far). I think the issue is that while ROR has a latency of 1, it takes two micro-ops on P1 or P5. To get a latency of 1, both have to be available. The others are just one micro-op, and thus always a latency of 1. AND, ADD, and MOV can issue on P0, P1, or P5, but SHR cannot be on P1. I can get closer to 10 cycles by adding some extra junk ops which prevent ADD and AND from displacing SHR or ROR, but I'm not sure how to get below 10.

Block Throughput: 10.55 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [esi+0x5]
|           |     |           | 1.0   1.0 |     |     | CP | movzx ebx, byte ptr [esi+0x5]
| 0.2       | 0.6 |           |           |     | 0.3 |    | add esi, 0x5
| 0.3       | 0.3 |           |           |     | 0.3 |    | mov ecx, 0x3
| 0.2       | 0.2 |           |           |     | 0.6 |    | mov edx, 0x3
| 1.4       |     |           |           |     | 0.6 | CP | ror al, 0x4
| 0.1       | 0.7 |           |           |     | 0.2 | CP | and ecx, ebx
| 0.6       |     |           |           |     | 0.4 | CP | shr ebx, 0x6
| 0.1       | 0.7 |           |           |     | 0.2 | CP | add ebx, ecx
| 0.3       | 0.4 |           |           |     | 0.3 | CP | and edx, eax
| 0.6       |     |           |           |     | 0.3 | CP | shr eax, 0x6
| 0.1       | 0.7 |           |           |     | 0.2 | CP | add eax, edx
| 0.3       | 0.3 |           |           |     | 0.3 | CP | add esi, ebx
| 0.2       | 0.2 |           |           |     | 0.6 | CP | add esi, eax
Nathan Kurz
  • 1,649
  • 1
  • 14
  • 28
  • What's the ASM_COPY(key2, 0xAA) for? It looks like key2 is overwritten without being used 3 lines later. – AShelly Jul 30 '13 at 03:52
  • Good catch. That was a leftover from the vacillating version I mentioned that combined the mov/and into `and reg, ptr [mem]`. It's not needed for this version. I'll remove. – Nathan Kurz Jul 30 '13 at 19:06
  • I was trying to experiment with this using IACA with gcc on Linux. do you mind sharing your `ASM_...` macros? I can't quite get the same output. – AShelly Jul 31 '13 at 01:49
  • AShelly - I should mention that although I've been able to get GCC and ICC to produce the assembly shown with the macros, I usually haven't been able to get IACA_START and IACA_END at exactly the correct locations without hand editing the "output.s". Add "-save-temps" to your compiler options to save this, then reassemble with "gcc -c output.s -o output.o", then run "iaca.sh" on that. But once you are doing this, it may be simpler to skip my ASM macros and just edit the "output.s" file directly. – Nathan Kurz Jul 31 '13 at 04:24
  • `offsetTable` can be a table of `uint8_t` that gets loaded with `movzx`. That doesn't make it faster, except for saving L1d cache footprint (and a tiny bit of code-size). Also, this disassembly of an un-linked object file is confusing because the `[rax*8]` addressing mode omits the `[offsetTable + rax*8]` part. With a placeholder `disp32=0` not yet filled in by the linker, that *is* how one would decode it (and `[rax*8]` does require a placeholder disp32=0), but you can use `objdump -drwC` to get symbol info, and `-Mintel` to get Intel-syntax. (But prob. not worth updating now :/) – Peter Cordes Apr 01 '21 at 18:28
0
  mov al,1
  mov ah,2
  mov bl,3
  mov bh,4
  add ax,bx
  add al,ah
zzz
  • 9
  • 1
  • I added an edit to ask for clarification, but it hasn't come through yet. I worked through your example, but am not sure what to do with the result. Could you clarify? Is it related to what aka.nice proposed? – Nathan Kurz Jul 31 '13 at 00:12