3

I have a function that computes the checkLeadingZeroBits as follows:


typedef std::array<uint8_t, 32> SHA256Hash;

bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) {
    int bytes = challengeSize / 8;
    const uint8_t * a = hash.data();
    if (bytes == 0) {
        return a[0]>>(8-challengeSize) == 0;
    } else {
        for (int i = 0; i < bytes; i++) {
            if (a[i] != 0) return false;
        }
        int remainingBits = challengeSize - 8*bytes;
        if (remainingBits > 0) return a[bytes + 1]>>(8-remainingBits) == 0;
        else return true;
    }
    return false;
}

I've also tried the following which is approximately the same run time:

    bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) {
        int bytes = challengeSize / 8;
        const uint8_t * a = hash.data();
        if (bytes == 0) {
            return a[0]>>(8-challengeSize) == 0;
        } else {
            if (memcmp(NULL_SHA256_HASH.data(), a, bytes) != 0) return false;
            int remainingBits = challengeSize - 8*bytes;
            if (remainingBits > 0) return a[bytes + 1]>>(8-remainingBits) == 0;
            else return true;
        }
        return false;
    }

I'd like to optimize this function to run as quickly as possible. Is there a simpler way to check the leading order bits than using the for loops shown in my implementation?

user491880
  • 4,709
  • 4
  • 28
  • 49
  • 3
    Are you targeting a particular compiler? Certain compilers provide certain intrinsics for bit counting like this. – Drew Dormann Dec 15 '21 at 23:19
  • 1
    Does this answer your question? [What is the fastest way to check the leading characters in a char array?](https://stackoverflow.com/questions/63390851/what-is-the-fastest-way-to-check-the-leading-characters-in-a-char-array) – Offtkp Dec 15 '21 at 23:22
  • @Offtkp: Sure, if you interpret `char` as eight bits. – Robert Harvey Dec 15 '21 at 23:23
  • 1
    The only relevant optimization I can think of is to reinterpret the memory at the machine word size rather than iterating a byte at a time. I think modern compilers might have already done this for you on this kind of code, although I'm not sure if it's simple enough to be "recognized" like that. – Karl Knechtel Dec 15 '21 at 23:24
  • @Offtkp it kind of gets there, but it doesn't handle the case where nBits%8 != 0 – user491880 Dec 15 '21 at 23:26
  • user491880, Curious, why start with an `unsigned for `challengeSize` and then shift to `int` math with `bytes, remainingBits`? Usually `int` math and `unsigned` performance the same, but if a difference, the edge tends to go to `unsigned`. – chux - Reinstate Monica Dec 16 '21 at 01:31
  • @chux : sloppiness is the main reason :) – user491880 Dec 16 '21 at 03:42
  • Is this somehow related to bitcoin? And what is the actual range of the `challengeSize`? For new blocks I think it's around 68 currently. And it's best to `memcpy` 64-bits to check for 0, which depending on the application is nonzero with a very high probability. – Aki Suihkonen Dec 16 '21 at 05:14

4 Answers4

4

As written, no. A raw byte array can't be scanned at the bit level faster than a byte by byte comparison operation.

However, if you're targeting a platform with a count leading zeros instruction or equivalent, you can speed up the process by checking blocks the size of a native processor word (32 bit for x86/ARM, 64 bit for AMD64/AA64). This would give you a slight improvement, but overall is likely not significant enough to be useful in this instance.

To do this, you would need to transform the data into a C pointer, then cast that to a pointer the same size as a word on the machine (uintptr_t*). From there, you could read out blocks of memory, and pass them to either the compiler intrinsic that would count the leading zeros of that word.

There are a few caveats to this. For one, you need to be very careful that your last read from the new pointer doesn't overstep into memory you don't own. Additionally, for x86 architectures prior to the introduction of lzcnt (before AMD ABM or Intel Haswell), you need to handle the case where the bsr (bit scan reverse, MSB to LSB) instruction finds no bits, which sets the zero flag instead of returning a real result (this will probably be invisible to you, or automatically transformed by the intrinsic, but I know MSVC's _BitScanReverse function returns true or false based on whether the instruction found a bit or not).

Ultimately, you're likely going to gain a marginal improvement in performance, which will likely be counterbalanced by the memory accesses you'll need to do to read data out of the buffer. Overall, it's not really worth it - the compiler will do a much better job of optimizing the loops than you will with all the instructions in the world. Better to write code that's clear and easily maintained than code that's fragile, obtuse, and reliant on processor specific instructions.

Which is a shame because I really like the lzcnt, bsr/bsf, and clz instructions. They can make scanning bitmaps significantly faster, but they aren't really useful here.

Alex
  • 1,794
  • 9
  • 20
  • Isn't this, strictly speaking, Undefined Behavior? You're reading a `uintptr_t` where there isn't one. I think you'd need to `memcpy` and hope the compiler optimizes it away. – François Andrieux Dec 16 '21 at 03:13
  • Got it, seems like it's worth a try, but not guaranteed to give results. The check zero bits is used to check hashes in a crypto miner, so it will be called billions of times -- every little piece counts, even a 10% difference is money in the bank. – user491880 Dec 16 '21 at 03:43
  • @FrançoisAndrieux Only by virtue of the pointer potentially lacking alignment and being oddly sized, which is why you need to be extremely careful. It's worth noting that at the assembly level, there's only integers and pointers, and if you're looking to use processor specific instructions you're basically one step up from that world. In fact, that's about the only way to make this code faster - hand code it in assembly. – Alex Dec 16 '21 at 22:30
  • @user491880 Then you've already lost there - you're competing against GPUs, which have these functions programmed in GPU specific assembly code (usually by compiling GLSL, Cg, or HLSL), or ASICs, which are dedicated FPGAs that have been built at the signal level to do this process. No amount of optimization at the CPU level will help you here. – Alex Dec 16 '21 at 22:34
  • 2
    "A raw byte array can't be scanned at the bit level faster than a byte by byte comparison operation." This is not true. In SSE 4.1 you can compare 16 bytes at a time with SIMD instructions. –  Dec 17 '21 at 02:18
  • Which, conceptually, is the same thing that I've mentioned, but using an opcode set entirely specific to x86/x64. In addition, you still have the issue of memory loading when using SSE, which will likely still cause your code to slow down, and SSE generally requires that memory addresses be 16 byte aligned, though SSE 4 seems to have relaxed that restriction. – Alex Dec 17 '21 at 19:52
  • 1
    @Alex just handle the unaligned bytes at the beginning and end manually. That's how glibc and other C libaries do for operations like memcpy, strlen... Auto-vectorization of compilers also do that: https://godbolt.org/z/vdKhn6Thz – phuclv Dec 18 '21 at 05:37
  • @phuclv I'm well aware of how C libraries typically do memcpy/strlen - that's what I drew on in my answer. Those unaligned bytes are exactly why you need to be careful about how you implement a word aligned comparison - if you have a platform with strict alignment concerns (like ARM), your pointer (allocated by the STL, in the OP) may not match the requirements of the platform to actually use this solution right out of the gate. Also worth noting those operations typically are built in assembly as they tend to be performance critical. – Alex Dec 25 '21 at 04:21
2

One of the issues here is that the char array seems to be in big endian format. This can be mitigated in some architectures by byte swap and in some other architectures by reading 64 bits anyway but concentrating on the LSB bits instead.

 //     MSB
 // x = 00 00 00 00 08 FE BB AB <-- as stored byte wise in memory
 // y = AB BB FE 08 00 00 00 00 <-- as read from memory in x86

 bool check(uint8_t const *x, int challenge_count) {
     uint64_t y;
     std::memcpy(&y, x, 8);
     if (y & (~0ull << (challenge_count & ~7))) {
     
     }
     ... check the remaining byte (or nibble)
 }

Here if you need to perform the same comparison many times, it makes sense to cache (~0ull << (challenge_count & ~7)). Also here the handling of challenge_count >= 64 is omitted, which can be implemented by memcpy followed by comparison to zero.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
0

There are SSE4.2 instructions that can be used to mass-compare bytes and speed up this algorithm. You can compare 16-bytes at a time with one instruction.

https://github.com/WojciechMula/sse4-strstr/blob/master/sse4.2-strstr.cpp

  • For checking against zero, SSE2 `pcmpeqb` (`_mm_cmpeq_epi8`) is fine. With `_mm_movemask_epi8` / `tzcnt`, you can find the position of the first non-zero byte, and bit-scan it. SSE4.2 instructions aren't particularly helpful (slower), but yes SIMD in general is. Especially AVX2 to check 32 all bytes at once. **Or even better, SSE4.1 `ptest` with the right mask to check that the low `n` bits are all zero.** – Peter Cordes Jun 08 '22 at 21:14
0

For a loop-invariant n, x86 SSE4.1 ptest can check this in one fairly cheap instruction. Create a mask with bits set in all the positions you require to be zero, and zeros elsewhere. Do this once when n changes. Assuming a use-case like bitcoin, that's rare.

bool no_bits_set(const std::array<uint8_t, 32> &arr, __m128i mask)
{
  __m128i v = _mm_loadu_si128((const __m128i*)arr.data());
  return  _mm_testz_si128(mask, v);
}

With AVX1, this can test for any n up to 256 with the right mask, or with just SSE4.1 any n up to 128. (Interestingly, _mm256_testz_si256 only requires AVX1, not AVX2)

This is equivalent to Aki's answer, but scalar uint64_t is limited to n = 64 where you'd be testing if (y & -1ULL) to check all 64 low bits.

If there's anything odd about bit-order or byte-order, that's something you can sort out when making the mask; you don't need to shuffle the data.


If you want to know the position of the lowest set bit, _mm_cmpeq_epi8 against _mm_setzero_si128() / _mm_movemask_epi8 will give you a bitmask you can scan with __builtin_ctz to find the position of the lowest non-zero bit. (If your bytes aren't in little-endian order, shuffle them before movemask.) That gives you the position of the lowest non-zero byte, so load it and bit-scan it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847