2

The bytes are unsigned and are all less than 16 so they can be fit into a nibble. I'm currently shifting the bytes in a loop and & them with 0xf:

pub fn compress(offsets: [u8; 8]) -> u32 {
    let mut co: u32 = 0;

    for (i, o) in offsets.iter().enumerate() {
        co |= ((*o as u32) & 0xf ) << (i * 4);
    }
    co
}

The compiler does already some good optimization on that:

https://godbolt.org/z/NEpC64

But maybe it is possible to do some bit twiddling or use SIMD commands with a u64 to reduce the amount of operations?

  • 1
    the nibbles could be interleaved `y = (x & 0x0f0f0f0f0f0f0f0f); y |= (y >> 28);` – aqrit Dec 10 '19 at 21:30
  • 1
    Thanks, interesting, as far as I understand the resulting byte order would be the following (big endian): 73625140 Unfortunately I have to preserve the order of these bytes. – Philipp Mildenberger Dec 10 '19 at 22:27
  • this is basically converting from unpacked BCD to packed BCD. Related: [What is the most appropriate way to convert nibbles to a u64?](https://stackoverflow.com/q/30835978/995714), [Most efficient formula for unpacking 16-bit BCD? (e.g. 0x1234 to 0x01020304)](https://stackoverflow.com/q/59669635/995714) – phuclv Apr 05 '21 at 14:32

1 Answers1

4

With the bitintr crate you can use pext:

bitintr::bmi2::pext(x, 0x0f0f0f0f0f0f0f0f)

However, that is only fast on Intel processors. AMD Ryzen implements BMI2, but its pext is very slow.

Here is an alternative with only normal code:

pub fn compress(offsets: [u8; 8]) -> u32 {
    let mut x = u64::from_le_bytes(offsets);
    x = (x | (x >> 4)) & 0x00FF00FF00FF00FF;
    x = (x | (x >> 8)) & 0x0000FFFF0000FFFF;
    x = (x | (x >> 16));
    x as u32
}

The steps do this:

start:         0x0a0b0c0d0e0f0g0h
x | (x >> 4):  0x0aabbccddeeffggh
& mask:        0x00ab00cd00ef00gh
x | (x >> 8):  0x00ababcdcdefefgh
& mask:        0x0000abcd0000efgh
x | (x >> 16): 0x0000abcdabcdefgh
as u32:                0xabcdefgh
harold
  • 61,398
  • 6
  • 86
  • 164
  • Your normal code is beautiful, and I love the illustration of the steps to see the bits flowing from one position to another. – Matthieu M. Dec 10 '19 at 08:28
  • Are either of these actually faster than what the OP provided? The question asks for the **fastest** solution, so I'd expect to see some benchmark that proves the implicit assertion that these are fast (or at least faster). – Shepmaster Dec 10 '19 at 14:53
  • 1
    @Shepmaster `pext` is a 3-cycle instruction on Intel, essentially unbeatable until a more specialized instruction shows up. But it's bad on AMD. That also disproves the existence of the fastest way to do it: relative speeds of various approaches depend on the hardware. So if we are to take OPs title literally, there is no answer. – harold Dec 10 '19 at 15:15
  • Thanks, the solutions are at least a good amount shorter (regarding the amount of opcode instructions) – Philipp Mildenberger Dec 10 '19 at 22:00
  • Interesting would be still if the `pext` implementation of AMD is faster than arch independent version, is there more info how it is implemented? Unfortunately I have no AMD processor to bench this. – Philipp Mildenberger Dec 10 '19 at 22:06
  • 2
    @PhilippMildenberger According to tests by [uops_info](https://twitter.com/uops_info/status/1202950247900684290) it seems like a microcoded loop based on the bits of the mask, generating 8 µops per set bit, costing around 4.5 extra cycles per set bit (there is some overhead too so mask=0 is already slow, and fractional cycles can happen because it's a reciprocal throughput). I can't test it either, but by estimation setting 32 bits should be around 145-150 cycles. That's over ten times worse than your original version. – harold Dec 11 '19 at 06:58