29

I reached a bottleneck in my code, so the main issue of this question is performance.

I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:

bool starts_with (char* cksum_hex, int n_zero) {
  bool flag {true};
  for (int i=0; i<n_zero; ++i)
    flag &= (cksum_hex[i]=='0');
  return flag;
}

The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.

I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:

// I have to convert cksum to string
std::string cksum_hex_s (cksum_hex);
cksum_hex_s.starts_with("000");     // checking for 3 leading zeros

For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.

Questions

  • What is the faster way of checking the leading characters in a char array?
  • Is there a more efficient way of doing it with std::string::starts_with?
  • Does the bitwise operations help here?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ali
  • 1,023
  • 1
  • 9
  • 31
  • Is it intended that your starts_with method does not check for the terminating character on cksum? Or do you have external knowledge that cksum_hex has more than n_zero characters? – Guillaume Gris Aug 13 '20 at 08:18
  • Is `n_zero` a number you know at compile time? – Guillaume Gris Aug 13 '20 at 08:19
  • 2
    @GuillaumeGris I am sure that n_zero never exceeds the range of cksum_hex. We do not know n_zero at compile time. – Ali Aug 13 '20 at 08:19
  • 8
    That's a pretty trivial function, it's hard to imagine a program where that can take 60% of the time, unless `n_zero` is often very high. Is the function inlined? I presume not, because inlined functions don't usually show up on profilers. Inlining might be the biggest win. Also, there are two types of profiler, instrumenting profilers and sampling profilers. Instrumenting profilers often add an overhead per function call which can inflate the importance of frequently called functions. – Columbo Aug 13 '20 at 08:19
  • 4
    May be a stupid question, but why are you using the "&=" operator? Couldn't it be easier and with less checks if you return false as soon as you spot a nonzero number? – User.cpp Aug 13 '20 at 08:25
  • `for (int i=0; i – User.cpp Aug 13 '20 at 08:28
  • 4
    For those who wonder why the bitwise operation might be a good idea, it allows the compiler to do some vectorization as it can "know" that `n_zero`. characters will be read. Imagine `n_zero == 4` then all this function is replaced by "load an unaligned int and check it's value is zero" instead of four comparisons. The compiler can't do that with the logical and – Guillaume Gris Aug 13 '20 at 08:28
  • how long is `cksum_hex`? Is the range of `n_zero` limited? – phuclv Aug 13 '20 at 08:44
  • Another thing to consider is just trying different optimization levels/settings. Sometimes a higher level does not yield optimum performance for a given use case. – Hulk Aug 13 '20 at 09:02
  • 1
    @Ali What's the upper bound for n_zero? Is it different for each string or do you test a whole range of strings with the same value? I'm wondering if you might benefit from branching once at the start of the dataset then calling a version of starts_with where n_zero is templated – Tharwen Aug 13 '20 at 09:34
  • Possibly relevant question: https://stackoverflow.com/q/63392584/580083 – Daniel Langr Aug 13 '20 at 10:14
  • 3
    Can you use x86 SIMD, like SSE2? – Peter Cordes Aug 13 '20 at 11:06
  • 2
    If you are doing so much processing on this value, why are you storing it in a human-friendly, machine-unfriendly format like ASCII hex? – David Schwartz Aug 13 '20 at 17:55
  • @DavidSchwartz Thanks for the feedback. I have not much knowledege about machine-friendly formats. Can you please elaborate more on this point. – Ali Aug 13 '20 at 18:01
  • @PeterCordes Yes, I can. However, I do not know how to apply SIMD in this function. I appreciate it if you give me a hint. – Ali Aug 13 '20 at 18:04
  • 1
    @Ali Just don't use ASCII. – David Schwartz Aug 13 '20 at 18:23
  • 1
    @Ali: I mentioned that possibility at the bottom of my answer. It would only be a good idea if you were reading hex digits as ASCII text and didn't otherwise need to convert them to a binary integer at all. If you can just use a 64-bit integer, shifting or masking it is even better. – Peter Cordes Aug 13 '20 at 18:42
  • If there were a much faster way of checking this than `string::starts_with`, wouldn't `string::starts_with` just use that instead? It's a good rule of thumb to just assume the answer is "yes" for most functions in well-known libraries (assuming it's not much more general than what you're trying to use it for). If this is the bottleneck of your code, and that's a problem, you likely need to rethink how you represent data and when you call this (to avoid calling this as often or at all) rather than trying to make this call as fast as possible. Or maybe you only need to check 1 or 2 positions. – Bernhard Barker Aug 14 '20 at 06:15
  • If you represented your checksum as an `int *`, it would divide the number of bytes to process by 2 (as 2 hex characters represent 1 byte) and you would also benefit from optimisations like checking if an int = 0 or if it's n*4 first bits are 0. I think this is what David suggests. – Didier L Aug 14 '20 at 11:53

7 Answers7

26

If you modify your function to return early

bool starts_with (char* cksum_hex, int n_zero) {
  for (int i=0; i<n_zero; ++i)
  {
    if (cksum_hex[i] != '0') return false;
  }
  return true;
}

It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:

// make it as big as you need
constexpr char cmp_array[4] = {'0', '0', '0', '0'};
bool starts_with (char* cksum_hex, int n_zero) {
    return std::memcmp(cksum_hex, cmp_array, n_zero) == 0;
}

The problem here is that you need to assume some max possible value of n_zero.

Live example

=== EDIT ===

Considering the complains about no profiling data to justify the suggested approaches, here you go:

Data used:

const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";

memcmp is fastest in all cases but cs2 with early return impl.

Early return vs memcmp memcmp vs orig

pptaszni
  • 5,591
  • 5
  • 27
  • 43
  • 3
    I don't believe both those implementatinos to be faster than the op implementation. Do you have any profiling data that shows this to be an improvement? – Guillaume Gris Aug 13 '20 at 08:34
  • Of course I don't have any profiling data, I just give the OP some ideas to try, then maybe something will work better for him. Early return might improve the performance for data like `0x1000000...`. For `memcmp` honestly no idea, because it depends on its implementation (see produced assembly in the examples). – pptaszni Aug 13 '20 at 08:40
  • @GuillaumeGris now I have – pptaszni Aug 13 '20 at 09:07
  • Wow, really interesting data, I did not expect memcmp to be faster here considering what the compiler can optimize. Thanks for actually doing the profiling! – Guillaume Gris Aug 13 '20 at 09:13
  • 3
    One thing to consider is that quick-bench doesn't seem to compile for recent architectures. Comparing the assembly on godbolt for the memcmp vs the original, -march=broadwell (for example) results in quite a few vector instructions for the memcmp version – Tharwen Aug 13 '20 at 09:14
  • 1
    @GuillaumeGris Compare these implementations made specifically for comparing 8 bytes: https://godbolt.org/z/E1WGn1. For some reason, the loop-based variants are **not** translated into a single (unaligned) load + cmp instruction. Only the `memcmp` version is. This holds for all GCC, Clang, and Intel. I have no idea why, it seems to be a pretty straightforward optimization. I will eventually ask another question about it. – Daniel Langr Aug 13 '20 at 09:42
  • 3
    I have bad news, I'm afraid: https://quick-bench.com/q/yEjbk9vg92UvVfkD4R4qyK-XmpU memcmp appears to be slower when n_zero isn't known at compile time – Tharwen Aug 13 '20 at 09:42
  • @Tharwen, that's true indeed, however still a little bit faster than `flag &=` inside the loop https://quick-bench.com/q/c5LT64Sk3IjtKDsOuTHiPkNJv80; It seems that `memcmp` is more efficient, for bigger data; As usual the answer to the OP problem is a bit like "it depends" :). – pptaszni Aug 13 '20 at 09:57
  • I was so confused by this result that I decided to try some profiling on my side. I can't reproduce the QuickBench results on my computer. My profiling shows that OP implem and memcpy are equivalent. Either it is hardware dependent or there is something subtle I missed – Guillaume Gris Aug 13 '20 at 10:14
  • This is likely to be hardware dependent. For memcmp to be faster it usually needs the CPU in question to support a better comparison operation than char == char. Are chars stored as 1 byte or native? Can the hardware compare 64-bits at once or even larger blocks? If the registers are big enough AND on two 1 kB long items could be 1 operation rather than 1000. The hope is that memcpy & a compiler can make that decision based on the target hardware to make things portable and more efficient than a byte by byte comparison. – TafT Aug 14 '20 at 13:19
11

Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.

e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.

I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.


bool starts_with (uint64_t cksum_high8, int n_zero)
{
    int shift = 64 - n_zero * 4;       // A hex digit represents a 4-bit nibble
    return (cksum_high8 >> shift) == 0;
}

clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2

high_zero_nibbles(unsigned long, int):
        shl     esi, 2
        neg     sil                  # x86 shifts wrap the count so 64 - c is the same as -c
        shrx    rax, rdi, rsi        # BMI2 variable-count shifts save some uops.
        test    rax, rax
        sete    al
        ret

This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.

Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)

bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {
    int shift = 64 - n_zero * 4;         // A hex digit represents a 4-bit nibble
    uint64_t low4n_mask = (1ULL << shift) - 1;
    return cksum_high8 & ~low4n_mask;
}

Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.

This is great if you want to know how many and don't have one specific cutoff threshold.

bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {
    // UB on cksum_high8 == 0.  Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0
    return __builtin_clzll(cksum_high8) > 4*n_zero;
}

#include <bit>
bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {
    return std::countl_zero(cksum_high8) > 4*n_zero;
}

compile to (clang for Haswell):

high_zero_nibbles_lzcnt(unsigned long, int):
        lzcnt   rax, rdi
        shl     esi, 2
        cmp     esi, eax
        setl    al                    # FLAGS -> boolean integer return value
        ret

All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.

See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.


To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.


Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Note the GCC makes a mess with some of these, wasting lots of instructions. Also, without `-march=haswell` or other `lzcnt`-enabling option, the `countl_zero` option has to branch on the input being all-zero. (And even with `-march=haswell`, GCC has some missed optimizations, still special-casting input=0 even though `lznct` does exactly the right thing.) – Peter Cordes Aug 14 '20 at 00:08
8

You can make a buffer of zeros large enough for you than compare with memcmp.

const char *zeroBuffer = "000000000000000000000000000000000000000000000000000";

if (memcmp(zeroBuffer, cksum_hex, n_zero) == 0) {
   // ...
}
I S
  • 436
  • 3
  • 7
6

Things you want to check to make your application faster:

1. Can the compiler inline this function in places where it is called?

Either declare the function as inline in a header or put the definition in the compile unit where it is used.

2. Not computing something is faster than computing something more efficiently

Are all calls to this function necessary? High cost is generally the sign of a function called inside a high frequency loop or in an expensive algorithm. You can often reduce the call count, hence the time spent in the function, by optimizing the outer algorithm

3. Is n_zero small or, even better, a constant?

Compilers are pretty good at optimizing algorithm for typically small constant values. If the constant is known to the compiler, it will most likely remove the loop entirely.

4. Does the bitwise operation help here?

It definitely has an effect and allows Clang (but not GCC as far as I can tell) to do some vectorization. Vectorization tend to be faster but that's not always the case depending on your hardware and actual data processed. Whether it is an optimization or not might depend on how big n_zero is. Considering you are processing checksums, it should be pretty small so it sounds like a potential optimization. For known n_zero using bitwise operation allows the compiler to remove all branching. I expect, though I did not measure, this to be faster.

std::all_of and std::string::starts_with should be compiled exactly as your implementation except they will use && instead of &.

Guillaume Gris
  • 2,135
  • 17
  • 34
3

Adding my two cents to this interesting discussion, though a little late to the game, I gather you could use std::equal, it's a fast method with a slightly different approach, using a hardcoded string with the maximum number of zeros, instead of the number of zeros.

This works passing to the function pointers to the begin and end of the string to be searched, and to string of zeros, specifically iterators to begin and end, end pointing to position of one past of the wanted number of zeros, these will be used as iterators by std::equal:

Sample

bool startsWith(const char* str, const char* end, const char* substr, const char* subend) {
    return  std::equal(str, end, substr, subend);
}
int main() {

    const char* str = "000x1234567";
    const char* substr = "0000000000000000000000000000";
    std::cout << startsWith(&str[0], &str[3], &substr[0], &substr[3]); 
}

Using the test cases in @pptaszni's good answer and the same testing conditions:

const char* cs1 = "00000hsfhjshjshgj";
const char* cs2 = "20000hsfhjshjshgj";
const char* cs3 = "0000000000hsfhjshjshgj";
const char* cs4 = "0000100000hsfhjshjshgj";

The result where as follows:

enter image description here

Slower than using memcmp but still faster (except for false results with low number of zeros) and more consistent than your original code.

anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • @CodyGray, yes, I had some tests which where wrong and the gave me around 1.0, I edited that out but forgot to change the "faster" remark, still these are faster than the OP especially when number of zeros starts to grow. – anastaciu Aug 14 '20 at 07:47
  • 1
    Note that your benchmarks are all for compile-time-constant lengths, letting clang generate to code like `movzwl (%rax),%edx` / `xor (%rcx),%dx` and finally an `or` to check 3 bytes for being different. (I had expected to find it was just assigning a compile-time-constant bool to `result` since you only used `benchmark::DoNotOptimize(result);`, not on the inputs. But turns out is is using 2 pairs of load/xor each for 3-byte vs. 10-byte, so that explains the constant time for different strings. It would probably be worse with a 9-byte compare if it's not smart about overlapping.) – Peter Cordes Aug 19 '20 at 04:28
  • 1
    But anyway, this use of a couple fixed widths of compares to reach the requested width only works when the width is known at compile time. Otherwise it might just call `memcmp`. – Peter Cordes Aug 19 '20 at 04:28
  • @PeterCordes, thanks for your great analysis, in any case, without DoNotOptimize this will be basically the same as using memcmp, with optimization. – anastaciu Aug 19 '20 at 08:22
3

Unless n_zero is quite high I would agree with others that you may be misinterpreting profiler results. But anyway:

  • Could the data be swapped to disk? If your system is under RAM pressure, data could be swapped out to disk and need to be loaded back to RAM when you perform the first operation on it. (Assuming this checksum check is the first access to the data in a while.)

  • Chances are you could use multiple threads/processes to take advantage of a multicore processor.

  • Maybe you could use statistics/correlation of your input data, or other structural features of your problem.

    • For instance, if you have a large number of digits (e.g. 50) and you know that the later digits have a higher probability of being nonzero, you can check the last one first.
    • If nearly all of your checksums should match, you can use [[likely]] to give a compiler hint that this is the case. (Probably won't make a difference but worth a try.)
Artelius
  • 48,337
  • 13
  • 89
  • 105
0

Use std::all_of

return std::all_of(chsum_hex, chsum_hex + n_zero, [](char c){ return c == '0'; })
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    This is elegant and really a good use of stl algorithm but does not seem to answer OP questions. Why would this be faster? Is the bitwise comparison he does of any use? – Guillaume Gris Aug 13 '20 at 08:44
  • @GuillaumeGris if the range of `n_zero` is unknown there's no way know what's the fastest way. Just use `std::all_of` and let the compiler do the vectorized comparison – phuclv Aug 13 '20 at 11:08