Why aren't std::count and std::find optimised to use memchr?

Question

I was reading sehe's answer to this question and was surprised to see sehe found using a hand written loop using std::memchr to be over 3 times faster than using std::count (see comments). The code using std::count can be seen in edit 2, but it basically boils down to:

const auto num_lines = std::count(f, l, '\n');

vs

uintmax_t num_lines = 0;
while (f && f != l)
    if ((f = static_cast<const char*>(memchr(f, '\n', l - f))))
        num_lines++, f++;

I would have expected the std::count version to be at least as fast as the std::memchr one - for a similar reason to why using std::copy should be at least as fast as std::memcpy.

I checked my standard library's (libc++) implementation of std::count, and there is no attempt to optimise for char input types (ditto for std::find).

Why is this? Could the implementations not dispatch to std::memchr if provided with char* iterators, and a char value?

@rici, haha ... however, in reality submitting a patch is often the easiest part. Getting it applied is the real challenge. For example, I've submitted a libstdc++ [patch which specializes](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545) `std::find()` for simple random access cases to just call `__builtin_memchr()`. This was 2 years ago - and just submitting it didn't spark any interest. — maxschlepzig, Jan 09 '21 at 15:50

score 4 · Answer 1 · answered Nov 15 '17 at 07:03

4

Using an actual function call to memchr is only a win if the average distance between matches is not small.

Especially for count, calling memchr could be a lot slower if you were counting t characters when they appear on average every 2 or maybe every 4. (e.g. with DNA base-pairs using an alphabet of ACGT).

I'd be skeptical of using a memchr loop as the default implementation for std::count over char arrays. There are more likely other ways to tweak the source so it compiles to better asm.

For find it would make more sense, even though it does potentially increase the overhead significantly vs. a simple byte-at-a-time loop if there's a hit in the first couple bytes.

You could also look at this as a compiler missed-optimization. If compilers made better code for the loops in std::count and std::find, there'd be less to gain from calling hand-written asm library functions.

gcc and clang never auto-vectorize loops when the trip-count isn't known before entering the loop. (i.e. they don't do search loops, which is a major missed optimization for element sizes as small as bytes). ICC doesn't have this limitation, and can vectorize search loops. I haven't looked at how it does with libc++'s std::count or find, though.

std::count has to check every element, so it should auto-vectorize. But if gcc or clang don't even with -O3, then that's unfortunate. It should vectorize very well on x86 with pcmpeqb (packed compare bytes), and then paddb the 0/-1 compare results. (every 255 iterations at least, psadbw against zero to horizontally sum the byte elements).

Library function call overhead is at least an indirect call with a function pointer from memory (which can cache miss). On Linux with dynamic linking there's usually an extra jmp through the PLT as well (unless you compiled with -fno-plt). memchr is easier to optimize with low startup overhead than strchr, because you can quickly check whether a 16B vector load can go past the end (vs. aligning the pointer for strchr or strlen to avoid crossing a page or cache-line boundary)

If calling memchr is the best way to implement something in asm, then in theory that's what the compiler should emit. gcc/clang already optimize large copy loops to calls to libc memcpy, depending on target options (-march=). e.g. when the copy is large enough that the libc version may decide to use NT stores on x86.

answered Nov 15 '17 at 07:03

Peter Cordes

328,167
45
605
847

Your comment about crossing a cache line crossing when optimizing `strchr` (as opposed to simply page crossing) is interesting. Do you mean that implementations do/should try to avoid such crossing because it would unnecessarily touch an adjacent cache line, which isn't a correctness problem (unless I'm missing something) but may be a performance problem, especially if it causes false sharing? I don't see that much difference between `memchr` and `strchr` though: both have to make that same decision about cache line crossing, and even if though `memchr` might know ... – BeeOnRope Nov 15 '17 at 19:23
... that the logical memory region might be at least 16 bytes, if it chooses to do a 16B load based on that fact it might still read into the next chache line unecessarily when the character is found before that. I haven't checked modern implementations, but doing a 16B or 32B load right off the bat seems very advantageous in real world situations for either function since it greatly quantizes the branching behavior: return values of <= 16 will all have the same behavior and be well predicted. Short byte-wise "intro" loops have bad behavior unless the length is always exactly the same. – BeeOnRope Nov 15 '17 at 19:23
1

@BeeOnRope: hmm, good point. That part of my answer wasn't thoroughly thought out. With `memchr`, touching a 2nd possibly-not-needed cache line is still part of the same object. Unlike `strlen`, the calling code for memchr might well also load from memory after the match. I'm not sure about optimal `strlen` startup strategy. I guess aligned vector load that includes the start of the string is good, then pcmpeqb / pmovmskb / not / bsr and compare against `p&-16` (the offset of the string start relative to 16B alignment.) – Peter Cordes Nov 15 '17 at 19:39
1

Huh, based on my reading of the C11 standard, you can't actually make the optimization for `memchr` that you suggest[1]: even if the user passes in an `n` > 16, it seems that maybe you cannot read 16 bytes if it were to cross a page since the element could be found before (i.e., the user can pass a too-long length if the function will never access the out-of-bounds region because it found the element). I'm not sure though, so I asked [here](https://stackoverflow.com/q/47315902/149138). [1] I'm assuming you are suggesting that it's safe to read the entire region implied by the user-passed `n`. – BeeOnRope Nov 15 '17 at 19:42
Yeah, I agree there is a hierarchy of badness when it comes to accessing memory beyond the found element. From least to most bad: (A) accessing only up to the found element (B) accessing beyond the found element but not into the next cache line (C) accessing beyond the found element and into the next cache line, but not outside the logical size of the whole object (i.e,. the `n`-sized region for `memchr` or the strlen for strings) (D) accessing beyond the logical size of the object but staying within the same page and (E) accessing beyond the logical size of the object into the next page. – BeeOnRope Nov 15 '17 at 19:51
In practice (E) is not going to be correct unless the implementation has some special knowledge, so we can mostly eliminate it. (A) and (B) seem tied to me: accessing beyond the found element but within a cache line seem harmless (not exactly: I guess there are store-forwarding concerns...). So it's mostly AB, C and D. Unlike `memchr` `strchr` has a hard time implementing C since the end of the object is also implicit so it ends up doing D instead. However both can do `B` with a bit of work. – BeeOnRope Nov 15 '17 at 19:54
@BeeOnRope: (B) being worse than (A) only because of possible store-forwarding issues, like with memchr finding a match in the first 8 bytes that was just written with a qword store, store-forwarding stall if you load it with a 16B load? (Same problem if you do an aligned vector load that includes data from before the object / before the last store). Yup, I see your followup comment already covered that. – Peter Cordes Nov 15 '17 at 19:55
Yeah, exactly. Now this store forwarding thing is fairly unlikely and has a limited impact (much smaller than say a miss to DRAM or false sharing), so I'd say you'd mostly want to aim for B (or further down like C or D) over A since A completely eliminates anything other than byte-by-byte reads. ISTM that if you were ok with D a good approach for either function would be to first check if you were close to the end of the page, and if not do `M` unaligned loads interleaved with checks (i.e., a small unrolled intro) and then switch over to a cache-line aligned loop. – BeeOnRope Nov 15 '17 at 20:02
1

@BeeOnRope: You don't like the idea of an aligned vector load that includes the start of the buffer? You can BSF it to ignore matches from before the start of the bytes you're supposed to be looking at. That avoids even crossing a cache-line. Not a big deal, but it avoids any branching. Hmm, I guess it's not as easy if the vector load includes data from before *and* after the region you're supposed to be looking at. I guess a variable-count shift of the `pmovmskb` result could help... But maybe branching is easier. – Peter Cordes Nov 15 '17 at 20:05
1

I like the idea too, but if you ignore the "problem" with accessing into the next cache line, it seems strictly worse, because you (a) examine fewer bytes in the first load which is bad by itself (less work done) and it also implies (b) worse quantization of the branching behavior (c) slightly more work to mask out/check that you didn't match before the buffer. It's (b) that can bite you the most: if most of your matches are quite short it's really good to capture all matches < 16 in the first load & branch. – BeeOnRope Nov 15 '17 at 20:12
If you do an aligned load it will only capture matches somewhere _up to_ the first 1 to 16 bytes depending on alignment, so more cases will need a second load and be less predictable. It also breaks the rule that if the strings are all the _same_ length: it should be perfectly predictable (even the byte-by-byte approach gives you that) - now same length matches will have different branching behavior depending on alignment. – BeeOnRope Nov 15 '17 at 20:12
1

Of course, you can't ignore the cost of going into the next cache line, so maybe B is better after all. It's hard to compare them in a general way (certainly for specific applications one may be obviously better than the other). Also you avoid the need for the special case near the end of the page if you do aligned loads only. – BeeOnRope Nov 15 '17 at 20:13
About the way to handle "out of range" elements when you do a load that includes them, there are a lot of different ways. You can try to mask out the invalid elements in the vector domain (before the `pmovmsk`) or in the integer domain (after), by loading a mask from memory or by generating the the mask (perhaps by a variable shift). You can try to shift the characters or the `cmp` result in the vector domain or shift the result of the `pmovmsk` in the integer domain. Finally you can try to `cmov` the result of the `bsf` to correct an out-of-range result. Did I miss any? – BeeOnRope Nov 15 '17 at 20:32
The `cmov` approach is appealing for the last iteration, to check that you didn't find a match in the trailing elements beyond the `n` or the `\0`, but for the startup load it's less clear. Now, I'm not sure I understood your suggestion above: _then pcmpeqb / pmovmskb / not / bsr and compare against `p&-16`_. You are talking about the case where the load is aligned and out of range elements occur before the start of the region, right? I don't see how `bsr` works here: it's always going the wrong way (finding later elements), I think? – BeeOnRope Nov 15 '17 at 20:37
Maybe what you mean is do a backwards scan in the first load (starting in the valid region, hopefully - although not always) and then if it finds a match before the region, you know there is no match in the region and continue. The problem is this pessimizes the common important case where there _is_ a match in the region covered by the first load: you have to find it, and `bsr` didn't tell you the answer, so you end up doing one of the other approaches. Probably better to start with one of the other approaches to begin with. – BeeOnRope Nov 15 '17 at 20:40
@BeeOnRope: oh right, that's what I said but it's not as useful as I thought. So you actually want a variable-count right shift to get rid of bits *before* the region of interest, and `bsf` it. And then check that it wasn't beyond the region of interest. Maybe branching before the actual `not` / `bsf` based on the region being all non-matches. – Peter Cordes Nov 15 '17 at 20:44
Yes, that's one valid approach. It works well with the BMI shift instructions which suck less than the `, cl` ones. In benchmarks loading a mask from memory probably wins (it's like one out-of-line uop to `and` with a memory source, plus you need to adjust the pointer if you get a hit), but yeah benchmarks vs cache misses, etc. That mask can be re-used for many similar methods that all have the same concern though... BTW, all these different tradeoffs (and much more, like good startup vs long string throughput) are why it seems like JIT compilation should be able to overtake static one day... – BeeOnRope Nov 15 '17 at 20:47

Why aren't std::count and std::find optimised to use memchr?

1 Answers1

Linked