Fast SIMD extraction of variable-sized fields with byte delimiter

Question

Suppose we want to extract fields from an input stream composed of variable-length fields. All we know is maximum width of every field and that each field ends with a byte with a value of 1. We want to extract the packed fields into a fixed format where each field has its maximum width (zero padded if the input field was less than the maximum).

Minimum width of each field is one byte.

For example, we are expecting to receive values for two fields. The maximum width of the first one is 3 bytes, maximum width of the second one is 2 bytes.

Suppose we've got an input vector {X, 1, 1} so we know the value of the first field is {X, 1} and value of the second field is {1}. So in this case the resulting vector should be equal to {0, X, 1, 0, 1}.

Or, we we've got an input vector {1, 1}, so the resulting vector should be equal to {0, 0, 1, 0, 1}.

I think I know a way of doing this with a lookup table. The problem is that we will end up with too big lookup table in case we decide to process more than 64 bits at once.

I'm not quite sure I understand what you're trying to accomplish, but it seems like something that could be done using a sequence of [permutation instructions](https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations). — Cody Gray - on strike, Jul 21 '17 at 16:37
The problem is that shape of the input vector can be different. For the example above it can be one of {1, 1}, {X, 1, 1}, {X, X, 1, 1}, {1, X, 1}, etc. So in order to use permutation I need to generate a mask first, but I don't know how to do in efficiently. — Alexey R., Jul 21 '17 at 16:45
What I'm confused about is how you would ever know what the interpretation of bytes in the vector is. You'd need to pass that information to the function somehow; why not pass it as the required mask? — Cody Gray - on strike, Jul 21 '17 at 16:49
Every time we see '1' in the input vector we know this is the last byte of current field. So {1, 1} means that first '1' goes to least significant byte (LSB) of the first field and the second '1' goes to LSB of the second field. On the other hand, if we've got {X, 1, Y, 1} that means we should fill two bytes of the first field with {X, 1} and two bytes of the second field with {Y, 1}. So for every field of width N bytes we should expect a value in the input vector of width [1; N]. — Alexey R., Jul 21 '17 at 16:59
Will a maximum width field always have a 1 as it's final byte, or is there a rule that you just take a field to be it's maximum width when you don't find any 1? That is, in your example above, would the input {X, X, X, Y, Y} (where none of the Xs or Ys are 1) be decoded as {X, X, X} and {Y, Y} or would this be an error? — BeeOnRope, Jul 21 '17 at 18:12
Yes, the last byte for each field equals 1, so input {X, X, X, Y, Y} (where none of the Xs or Ys are 1) is an error. — Alexey R., Jul 21 '17 at 18:17
Does the algorithm need to _detect_ such an error, or can it simply give wrong output or crash in that case? — BeeOnRope, Jul 21 '17 at 18:21
The algorithm should work with minimum latency and doesn't need to detect any sort of errors. — Alexey R., Jul 21 '17 at 18:24

BeeOnRope · Answer 1 · 2017-07-21T18:14:14.500

3

One reasonable approach would be to use a vectorized cmp to find all the 1s, and then movmskb those results as a bitmap into a general purpose register, and then use that value to look up a pshufb mask that expands the bytes into the fields based on the bitmap.

This technique also handles the "maximum field with" restriction without cost since that behavior is built into the shuffle masks in the lookup table.

Now you aren't going to be able to take a full 32-byte ymm register will create a 32-bit bitmap and look that up directly, since it would need something like a 128 GB lookup table¹, which isn't feasible (or at least will be extremely slow). In practice you'd process some fixed number of output bytes that keeps your table size reasonable, something between 8 to 16 bytes, for example. The optimal value probably depends on how many times you do this operation a tight loop and the cost of cache pressure on the surrounding code.

Let's say you still want more speed. You could look at the actual distribution of field lengths, and if a few "typical" arrangements dominate, you could have an optimistic algorithm which takes the bitmap, hashes it down to a smaller number of bits, and then looks up that value in a first-level shuffle control table, which only has entries for the "expected" field lengths. In parallel you do another lookup to verify that the actual full bitmap matches the expected full bitmap associated with the first level table.

When you hit in the first level table, you proceed as above, with a large (16 or 32 byte) shuffle, otherwise you fall back to several smaller lookups as above. The hash needs to be something like a "perfect hash" for the expected values so that there are no collisions.

You can calculate the lookup tables at runtime, or embed then as constants in the binary itself.

¹ ... and even if you did create such a monster lookup table you'd run into the limitation that pshufb works in two 16-byte lanes, not across a whole 32-byte register.

edited Jul 21 '17 at 18:14

answered Jul 21 '17 at 17:54

BeeOnRope

60,350
16
207
386

Thank you. I was thinking about the same solution, except second LUT for most common patterns. I'll do some benchmarks and post the results. – Alexey R. Jul 21 '17 at 18:14
1

You can also try to _calculate_ the shuffle mask rather than looking it up, either in the vector registers or in general purpose registers (where you can use `pdep` which helps a lot). Something along the lines of what Peter shows in [this answer](https://stackoverflow.com/a/36951611/149138) - but since you are doing it byte-wise the calculation solution needs to scaled up (of course the LUT solution has the same problem) which makes it costlier. – BeeOnRope Jul 21 '17 at 18:18
@AlexeyR. and @ Bee: Probably faster to just work with 16 bytes at a time, using `pshufb`. This is a very similar problem to [parsing an IPv4 dotted-quad](https://stackoverflow.com/questions/31679341/fastest-way-to-get-ipv4-address-from-string) with pcmpeqb / pmovmsk / pshufb. Answers+comments on that question have some interesting ideas for compressing the LUT to trade-off extra instructions vs. size + cache misses. – Peter Cordes Jul 27 '17 at 08:00
@Peter Cordes - is that different than what I'm suggesting? I left the issue of whether you can even use the 32-byte width rather than a single lane to the side since as a first step you really need to see how you'll generate your shuffle masks. If you have to do a lookup in a big table for each lane and you don't have aren't processing it 32-bytes at a time after, maybe 16 bytes is OK. On other hand, trying to do it 32-bytes at a time helps relieve store and shuffle port pressure if you find a way to generate the mask efficiently. – BeeOnRope Jul 27 '17 at 08:19
The problem is the lack of lane-crossing byte shuffles. With variable-length groups, the upper lane might not be at the start of a group, so you need multiple shuffles to handle the lane-crossing instead of just doing overlapping 16B ops. As well as the problem of the LUT size. – Peter Cordes Jul 27 '17 at 08:45
Sure, the lack of lane-crossing shuffles is a problem but not always a fatal one. You don't necessarily need _either_ lane to align to the start of a group anyway, you just need all the bytes to that will end up in the output to be somewhere in the lane and then you need your shuffle mask to accommodate that. For some expansion factors and distributions you can better much "center" each lane on the expected range and have a very high change of getting all the bytes needed in each lane. Or you can just do the popcnt and look up in a 16-byte style and then go 32B from the shuffle onwards. – BeeOnRope Jul 27 '17 at 08:55
@PeterCordes - my main point is that I think I made this answer pretty general and not really implying that 32B-wide will be the way to do it. The description is pretty generic. – BeeOnRope Jul 27 '17 at 08:57
1

Yeah, agreed. I mostly just wanted to comment to link to that `pshufb` for IPv4 answer, which has some actual working code for doing something almost exactly like this: zero-padding out to 4B (and then some nice pmadd stuff for the decimal place-values). – Peter Cordes Jul 27 '17 at 09:02

Fast SIMD extraction of variable-sized fields with byte delimiter

1 Answers1