Fastest way to pack two nibbles into one byte

Question

What would be the fastest way to pack two bytes into one? I have a large array of bytes. Each byte represents a number not larger than 15 (4-bit number). Because of that I could pack two bytes into one, placing the first byte into the higher nibble and the later in to the lower nibble.

My current approach is to create a second array half the size of the original and then to iterate over the original array shifting it and | to get the nibbles. This works however it takes a while depending on the size of the array. The arrays are from a few thousand entries to a few million. It's not catastrophic but any optimization would be helpful

how big are the arrays? Because it _sounds_ like you're describing the only algorithm (except `|` instead of `&` — Mooing Duck, May 02 '14 at 19:10
Hopefully you realize that this will also double (or more) the time it takes to access a half-byte in the array. — zneak, May 02 '14 at 19:11
@MooingDuck, fixed that. The arrays are from a few thousand entries to a few millions. It's not catastrophic but any optimization would be helpful. — oipoistar, May 02 '14 at 19:14
On my machine, doing this with thirty million bytes takes less than 33ms. Is this really a performance problem you're encountering? http://coliru.stacked-crooked.com/a/a0eba0e907b5696e — Mooing Duck, May 02 '14 at 19:33
One really efficient (0 cycles!) approach would be to do nothing; just keep the array unpacked. Is there some measurable advantage to packing the array? — Jeremy Friesner, May 02 '14 at 19:47
@MooingDuck: you can trim about 10% by doing this: `unsigned lim = dest.size(); for(unsigned i=0, j=0; i — Edward, May 02 '14 at 19:48
@Edward: That trims about 10% of the _theoretical_ operations, but I doubt it has any impact on an optimized build. Requires testing. — Mooing Duck, May 02 '14 at 20:06
@MooingDuck: No, that's 10% as measured. Try it yourself: http://coliru.stacked-crooked.com/a/e0e63d2135c8f658 — Edward, May 02 '14 at 20:09
@Edward: When switching to microseconds for more accurate measurements, and comparing only the fastest runs (31801us and 26086us) (http://coliru.stacked-crooked.com/a/7e61e6a75cccd8cf, http://coliru.stacked-crooked.com/a/0addb0dbce245766), yours is actually appears 18% faster! ...And seems entirely caused by the `dest.size()` in the condition: http://coliru.stacked-crooked.com/a/24a6af954939d14a (25897us) This one is ~1.16 destination bytes per _nanosecond_. — Mooing Duck, May 02 '14 at 20:45
@MooingDuck: the C code I posted is another couple of percent faster. See http://coliru.stacked-crooked.com/a/caa2eed90de60c52 for the code with a complete test harness. — Edward, May 02 '14 at 21:21
@Edward: Why C? : http://coliru.stacked-crooked.com/a/687e23e1b50c3448 — Mooing Duck, May 02 '14 at 21:32
@MooingDuck: Two reasons: 1) I typically find it slightly more convenient to mix C and assembly vs. C++ and assembly and 2) on embedded platforms on which this kind of speed improvement might actually matter, it's more common to have C than C++ compilers. Given the choice, however, I usually prefer C++. — Edward, May 02 '14 at 21:44

zmbq · Answer 1 · 2014-05-02T19:28:14.600

4

It will obviously take a while if your array is large - you need to go over all of it.

First thing I'd do is create a lookup table from two bytes to one, so you don't need to shift and or - take the next two bytes, look up their offset and get the resulting byte.

This lookup table should have 2^12 entries (you only need 4 bytes from the most significant byte), and fit nicely in your CPU's L1 cache. It might be faster than shift-and-or.

On the other hand, if you load 8 bytes at a time (on a 64 bit CPU, as they all are nowadays), you can turn it into 4 bytes and store them. You will be able to parallelize this (divide the array into 4 parts, and have each core handle one part).

If there were an instructions that takes bytes 0, 2, 4 and 6 from a 64-bit register and puts them in a 32 bit register, you'd be done.

UPDATE: You mentioned in the question you have a few million bytes. In that case, don't bother. The difference between highly optimized assembly and the naive implementation in C is not going to be worth the trouble. Just load the data two bytes at a time, shift and or two nibbles into one byte and store in the target array. Processing 1MB of data should be instant.

edited May 02 '14 at 19:28

answered May 02 '14 at 19:11

zmbq

38,013
14
101
171

1

loading a whole 2^12 bytes into RAM is probably slower than simply calculating them. – Mooing Duck May 02 '14 at 19:12
If you have a few gigabytes of data - no. If all you have is a few kilobytes, performance shouldn't be an issue at all. – zmbq May 02 '14 at 19:13
if you have gigabytes of data, then the data is likely to push the table out of the cache, giving _very_ slow access. – Mooing Duck May 02 '14 at 19:17
1

The L1 cache of an i7 is 32KB - the table will probably stay in the cache for the entire time. However, I'm sure there's a clever CPU command that can translate 8 nibbles into 4 bytes quickly. I just don't remember all the packing an unpacking opcodes. – zmbq May 02 '14 at 19:20
How did you get 2^12? Shouldn't a lookup to a byte have 2^8 => 256 entries? What's the lookup strategy, 2-d array? – Paul Rubel May 02 '14 at 19:28
1

When you load two nibbles into memory, you have bits 12-15 all zeroed 0. Bits 8-11 are the high nibble, bits 4-7 are zerored out and bits 0-3 are the low nibble. So, if you don't want to do any shifting at all because it's slow for some reason, you have 12 bits. – zmbq May 02 '14 at 19:30

score 0 · Answer 2 · answered May 02 '14 at 19:39

0

I'd approach it first in C or C++, measure and then only resort to assembly if the performance is unacceptable. In C:

void packarray(unsigned char *buff, int len)
{ 
    unsigned char *packed;
    unsigned char byte;
    assert(len >= 2);  /* len must be at least 2 bytes */
    assert((len & 1) != 1);   /* len must be an even number */
    for (packed = buff; len>0; len-=2) {
        byte= *buff++;
        *packed++ = (byte << 4) | *buff++;
    }
}

Warning: untested code

answered May 02 '14 at 19:39

Edward

6,964
2
29
55

The loop could use one less math operation by using a pointer to the end of the packed buffer: unsigned char *pend; ... pend = buff+(len/2); for(packed=buff; packed < pend; packed++){packed[0] = (buff[0]<< 4) | buff[1]; buff += 2;} – rcgldr May 03 '14 at 00:38

Fastest way to pack two nibbles into one byte

2 Answers2