Other answers propose various means to add together values sitting in single variable (without unpacking them). While these approaches give pretty good throughput (with POPCNT in particular), they have large latency - either because long computation chains or because high-latency instructions used.
It may be better to use normal addition instructions (adding together one pair of values at once), use simple operations like masks and shifts to split these values one from another, and use instruction-level parallelism to do this efficiently. Also the position of two middle values in the byte hints for a variant of table lookup that uses a single 64-bit register instead of memory. All this allows to speed up calculation for sum of the four and use only 4 or 5 clocks.
Original table lookup approach suggested in OP may consist of the following steps:
- load byte with four values from memory (5 clocks)
- calculate sum of the values using lookup table (5 clocks)
- update pointer (1 clock)
64-bit register lookup
The following snippet shows how to perform step #2 in 5 clocks and also combine steps #2 and #3 keeping latency still at 5 clocks (which could be optimized to 4 clocks with complex addressing mode for memory load):
p += 5 + (*p & 3) + (*p >> 6) +
((0x6543543243213210ull >> (*p & 0x3C)) & 0xF);
Here constant "5" means that we skip current byte with lengths as well as 4 data bytes corresponding to all-zero lengths. This snippet corresponds to the following code (64-bit only):
mov eax, 3Ch
and eax, ebx ;clock 1
mov ecx, 3
and ecx, ebx ;clock 1
shr ebx, 6 ;clock 1
add ebx, ecx ;clock 2
mov rcx, 6543543243213210h
shr rcx, eax ;clock 2..3
and ecx, Fh ;clock 4
add rsi, 5
add rsi, rbx ;clock 3 or 4
movzx ebx, [rsi + rcx] ;clock 5..9
add rsi, rcx
I tried to produce this code automatically with following compilers: gcc 4.6.3, clang 3.0, icc 12.1.0. First two of them didn't do anything good. But Intel's compiler did the work almost perfectly.
Fast bitfield extraction with ROR instruction
Edit: Nathan's tests reveal a problem with following approach. ROR instruction on Sandy Bridge uses two ports and conflicts with SHR instruction. So this code needs 1 more clock on Sandy Bridge which makes it not very useful. Probably it would work as expected on Ivy Bridge and Haswell.
It is not necessary to use the trick with 64-bit register as a lookup table. Instead you could just rotate the byte by 4 bits which places two middle values to the positions of the first and the fourth values. Then you could process them the same way. This approach has at least one disadvantage. It is not so easy to express byte rotation in C. Also I'm not quite sure about this rotation because on older processors it may result in partial register stall. Optimization manual hints that for Sandy Bridge we could update part of the register if operation source is the same as destination, without the stall. But I'm not sure I understood it properly. And I have no proper hardware to check this. Anyway, here is the code (now it may be either 32-bit or 64-bit):
mov ecx, 3
and ecx, ebx ;clock 1
shr ebx, 6 ;clock 1
add ebx, ecx ;clock 2
ror al, 4 ;clock 1
mov ecx, 3
and ecx, eax ;clock 2
shr eax, 6 ;clock 2
add eax, ecx ;clock 3
add esi, 5
add esi, ebx ;clock 3
movzx ebx, [esi+eax] ;clocks 4 .. 8
movzx eax, [esi+eax] ;clocks 4 .. 8
add esi, eax
Using boundary between AL and AH to unpack bitfields
This method differs from the previous one only in the way two middle bitfields are extracted. Instead of ROR, which is expensive on Sandy Bridge, a simple shift is used. This shift positions second bitfield in register AL and third bitfield - in AH. Then they are extracted with shifts/masks. Like in previous method, there are possibilities for partial register stall here, now in two instructions instead of one. But it's very likely Sandy Bridge and newer processors can execute them without delay.
mov ecx, 3
and ecx, ebx ;clock 1
shr ebx, 6 ;clock 1
add ebx, ecx ;clock 2
shl eax, 4 ;clock 1
mov edx, 3
and dl, ah ;clock 2
shr al, 6 ;clock 2
add dl, al ;clock 3
add esi, 5
add esi, ebx ;clock 3
movzx ebx, [esi+edx] ;clock 4..8
movzx eax, [esi+edx] ;clock 4..8
add esi, edx
Load and compute sum in parallel
Also it is not necessary to load 4-lengths byte and to calculate the sum sequentially. You could do all these operations in parallel. There are only 13 values for sum of the four. If your data is compressible, you'll rarely see this sum greater than 7. Which means instead of loading a single byte, you could load first 8 most likely bytes to 64-bit register. And you could do it earlier than computing sum of the four. 8 values are loaded while the sum is calculated. Then you just get proper value from this register with shift and mask. This idea may be used together with any means for computing the sum. Here it is used with simple table lookup:
typedef unsigned long long ull;
ull four_lengths = *p;
for (...)
{
ull preload = *((ull*)(p + 5));
unsigned sum = table[four_lengths];
p += 5 + sum;
if (sum > 7)
four_lengths = *p;
else
four_lengths = (preload >> (sum*8)) & 15;
}
With proper assembly code this adds only 2 clocks to latency: shift and mask. Which gives 7 clocks (but only on compressible data).
If you change table lookup to computations, you may get loop latency of only 6 clocks: 4 to add together values and update the pointer, and 2 for shift and mask. Interestingly, in this case loop latency is determined only by computations and does not depend on latency for memory load.
Load and compute sum in parallel (deterministic approach)
Performing load and summation in parallel may be made in deterministic way. Loading two 64-bit registers and then selecting one of them with CMP+CMOV is one possibility, but it does not improve performance over sequential computation. Other possibility is using 128-bit registers and AVX. Migrating data between 128-bit registers and GPR/memory adds significant amount of latency (but half of this latency may be removed if we process two data blocks per iteration). Also we'll need to use byte-aligned memory loads to AVX registers (which also adds to loop latency).
The idea is to perform all computations in AVX except for memory load which should be done from GPR. (There is an alternative to do everything in AVX and use broadcast+add+gather on Haswell, but it's unlikely to be faster). Also it should be helpful to alternate data load to a pair of AVX registers (to process two data blocks per iteration). This allows pairs of load operations to overlap partially and cancels out half of additional latency.
Start with unpacking proper byte from register:
vpshufb xmm0, xmm6, xmm0 ; clock 1
Add together four bitfields:
vpand xmm1, xmm0, [mask_12] ; clock 2 -- bitfields 1,2 ready
vpand xmm2, xmm0, [mask_34] ; clock 2 -- bitfields 3,4 (shifted)
vpsrlq xmm2, xmm2, 4 ; clock 3 -- bitfields 3,4 ready
vpshufb xmm1, xmm5, xmm1 ; clock 3 -- sum of bitfields 1 and 2
vpshufb xmm2, xmm5, xmm2 ; clock 4 -- sum of bitfields 3 and 4
vpaddb xmm0, xmm1, xmm2 ; clock 5 -- sum of all bitfields
Then update address and load next vector of bytes:
vpaddd xmm4, xmm4, [min_size]
vpaddd xmm4, xmm4, xmm1 ; clock 4 -- address + 5 + bitfields 1,2
vmovd esi, xmm4 ; clock 5..6
vmovd edx, xmm2 ; clock 5..6
vmovdqu xmm6, [esi + edx] ; clock 7..12
Then repeat the same code once more, only using xmm7
instead of xmm6
. While xmm6
is loaded we may process xmm7
.
This code uses several constants:
min_size = 5, 0, 0, ...
mask_12 = 0x0F, 0, 0, ...
mask_34 = 0xF0, 0, 0, ...
xmm5 = lookup table to add together two 2-bit values
Loop implemented as described here needs 12 clocks to complete and 'jumps' two data blocks at once. Which means 6 cycles per data block. This number may be too optimistic. I'm not pretty sure that MOVD needs only 2 clocks. Also it's not clear what is latency of MOVDQU instruction performing unaligned memory load. I suspect that MOVDQU has very high latency when data crosses cache-line boundary. I suppose this means something like 1 additional clock of latency on average. So about 7 cycles per data block is more realistic estimation.
Using brute force
Jumping just one or two data blocks per iteration is convenient but does not fully use resources of modern processors. After some pre-processing we may implement jumping straight to first data block in the next aligned 16 bytes of data. Pre-processing should read the data, compute sum of the four fields for each byte, use this sum to compute "links" to the next four-byte fields, and finally follow these "links" up to the next aligned 16-byte block. All these computations are independent and may be computed in any order using SSE/AVX instruction set. AVX2 would do pre-processing two times faster.
- Load 16 or 32 byte data block with MOVDQA.
- Add together 4 bitfields of each byte. To do this, extract high and low 4-bit nibbles with two PAND instructions, shift high nibble with PSRL*, find sum of each nibble with two PSHUFB, and add two sums with PADDB. (6 uops)
- Use PADDB to compute "links" to the next four-field bytes: add constants 0x75, 0x76, ... to the bytes of XMM/YMM register. (1 uop)
- Follow the "links" with PSHUFB and PMAXUB (more expensive alternative to PMAXUB is combination of PCMPGTB and PBLENDVB).
VPSHUFB ymm1, ymm2, ymm2
does almost all the work. It substitutes "out-of-bound" values with zero. Then VPMAXUB ymm2, ymm1, ymm2
restores original "links" in place of these zeros. Two iterations is enough. After each iteration distance for each "link" is twice as large, so we need only log(longest_chain_length) iterations. For example, the most lengthy chain 0->5->10->15->X will be compressed to 0->10->X after one step and to 0->X after two steps. (4 uops)
- Subtract 16 from each byte with PSUBB, and (for AVX2 only) extract high 128 bits to a separate XMM register with VEXTRACTI128. (2 uops)
- Now pre-processing is finished. We may follow the "links" to first data block in the next 16-byte piece of data. This might be done with PCMPGTB, PSHUFB, PSUBB, and PBLENDVB. But if we assign range
0x70 .. 0x80
for possible "link" values, a single PSHUFB will do the work properly (actually a pair of PSHUFB, in case of AVX2). Values 0x70 .. 0x7F
select proper byte from the next 16-byte register while value 0x80
would skip next 16 bytes and load byte 0
, which is exactly what is needed. (2 uops, latency = 2 clocks)
Instructions for these 6 steps do not need to be ordered sequentially. For example, instructions for steps 5 and 2 may stand next to each other. Instructions for each step should process 16/32-byte blocks for different stages of pipeline, like this: step 1 processes block i
, step 2 processes block i-1
, steps 3,4 process block i-2
, etc.
The whole loop's latency might be 2 clocks (per 32 bytes of data). But limiting factor here is throughput, not latency. When AVX2 is used we need to execute 15 uops, which means 5 clocks. If data is not compressible and data blocks are large, this gives about 3 clocks per data block. If data is compressible and data blocks are small, this gives about 1 clock per data block. (But since MOVDQA latency is 6 clocks, to get 5 clocks per 32 bytes we need two overlapping loads and to process twice as much data in each loop).
Pre-processing steps are independent of step #6. So they may be performed in different threads. This may decrease time per 32 bytes of data below 5 clocks.