5

What would be the fastest way to pack two bytes into one? I have a large array of bytes. Each byte represents a number not larger than 15 (4-bit number). Because of that I could pack two bytes into one, placing the first byte into the higher nibble and the later in to the lower nibble.

My current approach is to create a second array half the size of the original and then to iterate over the original array shifting it and | to get the nibbles. This works however it takes a while depending on the size of the array. The arrays are from a few thousand entries to a few million. It's not catastrophic but any optimization would be helpful

bames53
  • 86,085
  • 15
  • 179
  • 244
oipoistar
  • 477
  • 6
  • 19
  • how big are the arrays? Because it _sounds_ like you're describing the only algorithm (except `|` instead of `&` – Mooing Duck May 02 '14 at 19:10
  • Hopefully you realize that this will also double (or more) the time it takes to access a half-byte in the array. – zneak May 02 '14 at 19:11
  • @MooingDuck, fixed that. The arrays are from a few thousand entries to a few millions. It's not catastrophic but any optimization would be helpful. – oipoistar May 02 '14 at 19:14
  • 1
    On my machine, doing this with thirty million bytes takes less than 33ms. Is this really a performance problem you're encountering? http://coliru.stacked-crooked.com/a/a0eba0e907b5696e – Mooing Duck May 02 '14 at 19:33
  • 1
    One really efficient (0 cycles!) approach would be to do nothing; just keep the array unpacked. Is there some measurable advantage to packing the array? – Jeremy Friesner May 02 '14 at 19:47
  • 2
    @MooingDuck: you can trim about 10% by doing this: `unsigned lim = dest.size(); for(unsigned i=0, j=0; i – Edward May 02 '14 at 19:48
  • @Edward: That trims about 10% of the _theoretical_ operations, but I doubt it has any impact on an optimized build. Requires testing. – Mooing Duck May 02 '14 at 20:06
  • @MooingDuck: No, that's 10% as measured. Try it yourself: http://coliru.stacked-crooked.com/a/e0e63d2135c8f658 – Edward May 02 '14 at 20:09
  • @Edward: When switching to microseconds for more accurate measurements, and comparing only the fastest runs (31801us and 26086us) (http://coliru.stacked-crooked.com/a/7e61e6a75cccd8cf, http://coliru.stacked-crooked.com/a/0addb0dbce245766), yours is actually appears 18% faster! ...And seems entirely caused by the `dest.size()` in the condition: http://coliru.stacked-crooked.com/a/24a6af954939d14a (25897us) This one is ~1.16 destination bytes per _nanosecond_. – Mooing Duck May 02 '14 at 20:45
  • @MooingDuck: the C code I posted is another couple of percent faster. See http://coliru.stacked-crooked.com/a/caa2eed90de60c52 for the code with a complete test harness. – Edward May 02 '14 at 21:21
  • @Edward: Why C? : http://coliru.stacked-crooked.com/a/687e23e1b50c3448 – Mooing Duck May 02 '14 at 21:32
  • @MooingDuck: Two reasons: 1) I typically find it slightly more convenient to mix C and assembly vs. C++ and assembly and 2) on embedded platforms on which this kind of speed improvement might actually matter, it's more common to have C than C++ compilers. Given the choice, however, I usually prefer C++. – Edward May 02 '14 at 21:44

2 Answers2

4

It will obviously take a while if your array is large - you need to go over all of it.

First thing I'd do is create a lookup table from two bytes to one, so you don't need to shift and or - take the next two bytes, look up their offset and get the resulting byte.

This lookup table should have 2^12 entries (you only need 4 bytes from the most significant byte), and fit nicely in your CPU's L1 cache. It might be faster than shift-and-or.

On the other hand, if you load 8 bytes at a time (on a 64 bit CPU, as they all are nowadays), you can turn it into 4 bytes and store them. You will be able to parallelize this (divide the array into 4 parts, and have each core handle one part).

If there were an instructions that takes bytes 0, 2, 4 and 6 from a 64-bit register and puts them in a 32 bit register, you'd be done.

UPDATE: You mentioned in the question you have a few million bytes. In that case, don't bother. The difference between highly optimized assembly and the naive implementation in C is not going to be worth the trouble. Just load the data two bytes at a time, shift and or two nibbles into one byte and store in the target array. Processing 1MB of data should be instant.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • 1
    loading a whole 2^12 bytes into RAM is probably slower than simply calculating them. – Mooing Duck May 02 '14 at 19:12
  • If you have a few gigabytes of data - no. If all you have is a few kilobytes, performance shouldn't be an issue at all. – zmbq May 02 '14 at 19:13
  • if you have gigabytes of data, then the data is likely to push the table out of the cache, giving _very_ slow access. – Mooing Duck May 02 '14 at 19:17
  • 1
    The L1 cache of an i7 is 32KB - the table will probably stay in the cache for the entire time. However, I'm sure there's a clever CPU command that can translate 8 nibbles into 4 bytes quickly. I just don't remember all the packing an unpacking opcodes. – zmbq May 02 '14 at 19:20
  • How did you get 2^12? Shouldn't a lookup to a byte have 2^8 => 256 entries? What's the lookup strategy, 2-d array? – Paul Rubel May 02 '14 at 19:28
  • 1
    When you load two nibbles into memory, you have bits 12-15 all zeroed 0. Bits 8-11 are the high nibble, bits 4-7 are zerored out and bits 0-3 are the low nibble. So, if you don't want to do any shifting at all because it's slow for some reason, you have 12 bits. – zmbq May 02 '14 at 19:30
0

I'd approach it first in C or C++, measure and then only resort to assembly if the performance is unacceptable. In C:

void packarray(unsigned char *buff, int len)
{ 
    unsigned char *packed;
    unsigned char byte;
    assert(len >= 2);  /* len must be at least 2 bytes */
    assert((len & 1) != 1);   /* len must be an even number */
    for (packed = buff; len>0; len-=2) {
        byte= *buff++;
        *packed++ = (byte << 4) | *buff++;
    }
}

Warning: untested code

Edward
  • 6,964
  • 2
  • 29
  • 55
  • The loop could use one less math operation by using a pointer to the end of the packed buffer: unsigned char *pend; ... pend = buff+(len/2); for(packed=buff; packed < pend; packed++){packed[0] = (buff[0]<< 4) | buff[1]; buff += 2;} – rcgldr May 03 '14 at 00:38