strstr faster than algorithms?

Question

I have a file that's 21056 bytes.

I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars.

I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr and benchmarked each one.

What I'm wondering is, each time the strstr outperforms all the other algorithms. The only one that is faster sometimes is BM.

Shouldn't strstr be the slowest?

Here's my benchmark code with an example of benchmarking BM:

double get_time()
{
    LARGE_INTEGER t, f;
    QueryPerformanceCounter(&t);
    QueryPerformanceFrequency(&f);
    return (double)t.QuadPart/(double)f.QuadPart;
}

before = get_time();
BM(token, strlen(token), buffer, len);
after = get_time();
printf("Time: %f\n\n", after - before);

Could someone explain to me why strstr is outperforming the other search algorithms? I'll post more code on request if needed.

The error is in your code. At least Horspool should consistently outperform `strstr` – KMP may actually be slower. But since you didn’t post your code, we can’t help you. That said, you can chose your data deliberately to make the naive search win so the choice of input data is also relevant. — Konrad Rudolph, Sep 28 '11 at 17:15
Are you seriously benchmarking searching for an 80 byte string in a 20k string?... Your sample size is so small you could do it by hand! — Blindy, Sep 28 '11 at 17:18

TonyK · Accepted Answer · 2011-09-29T11:07:04.873

32

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

(The reason I think this is likely is that programmers love to implement such things.)

edited Sep 29 '11 at 11:07

answered Sep 28 '11 at 17:20

TonyK

16,761
4
37
72

It’s almost guaranteed that `strstr` uses the naive algorithm. It performs well enough, and, this is the important part, it’s *general purpose*. KMP, for instance, sometimes performs *much* worse so it certainly isn’t used. – Konrad Rudolph Sep 28 '11 at 17:22
Huh, always thought `strstr` used some sort of brute force. – Josh Sep 28 '11 at 17:33
2

@Josh It does, don’t be misled. – Konrad Rudolph Sep 28 '11 at 17:34
IIRC Glibc has optimised asm functions for popular platforms and functions, and (more or less) naive fall-back implementations in C. – wildplasser Sep 28 '11 at 17:53
1

@Konrad: Look at Banthar's link. It supports my claim. And what's this about KMP sometimes performing much worse than the naive algorithm? I find that impossible to believe. Can you provide a reference? – TonyK Sep 28 '11 at 17:59
4

@Josh: It doesn't, don't be misled! – TonyK Sep 28 '11 at 18:00
1

TonyK: I suggest editing your answer to include @Banthar's link. Makes it obvious that you are right (well, it looks more like BM than KMP, but still). – Nemo Sep 28 '11 at 20:55
1

@TonyK KMP has good theoretical characteristics in terms of the number of comparisons but in practice it’s slower than the naive search since its operations are more complex. This is well known but if you need a reference, consider the standard work on this matter, [Flexible Pattern Matching, page 1](http://bit.ly/ov239o) by Navarro and Raffinot (the link is buggy, sorry – try pressing “Previous” and “Next” in the search results to get it to display correctly). Incidentally, this also shows that BM is in practice slower than Horspool. – Konrad Rudolph Sep 29 '11 at 06:44
2

@Banthar Colour me surprised. I *had* looked at multiple `strstr` implementations (I’ve worked extensively on string search implementations over the last few years) and every implementation I saw was the naive search – which is a *good choice*, by the way (see above). I’m also surprised that they used BM instead of Horspool for long needles since the latter is known to perform better on average in typical situations (again, see above). – Konrad Rudolph Sep 29 '11 at 06:51
TonyK, I agree with your answer, but for it to be more futureproof, it should not mention what kind of technique strstr() is implemented. Just something "likely to be very fast on the platform of choice". – Prof. Falken Sep 29 '11 at 10:52
@Amigable: OK, I changed it slightly. – TonyK Sep 29 '11 at 11:08
And that tipped the scale. +1 I like this answer very much. :) – Prof. Falken Sep 29 '11 at 11:10
@Konrad: When you said "*much* worse", I thought you meant "worse than `O(mn)`". I agree that naive search is likely to be faster than KMP for small strings. But then a good implementation will use a naive algorithm for such strings anyway. – TonyK Sep 29 '11 at 11:11
@TonyK It will be faster for small *needles*, regardless of the length of the haystack – this is important, because it’s the most common use-case by far: your text may be very large, but as long as the needle is relatively small, naive search will be faster than KMP. – Konrad Rudolph Sep 29 '11 at 11:22
1

@Konrad: You're still wrong, it's not BM -- as it says on line 1 it uses the Two-Way Algorithm, which amazingly takes only *linear time and constant space*, unlike BM which requires space proportional to the needle size, or BMH which is not linear-time on some inputs: http://igm.univ-mlv.fr/~lecroq/string/node26.html – j_random_hacker Sep 29 '11 at 11:46
@Konrad: When line 50 mentions the "bad character shift table", that should tell you that the sentence really means it's even more like BMH than BM (since both use that table, but BMH uses *only* that table). – j_random_hacker Sep 29 '11 at 11:52
@j_random_hacker Ah, so it does. I had only skimmed the code itself and surmised that two-way would only be used for short needles. This is indeed a sensible choice although I still think Horspool could be faster in many realistic scenarios (and Navarro & Raffinot also seem to think this). But you are wrong about the constant space: two-way *does* need space linear in the size of the needle, otherwise its worst case is quadratic, not linear! And the glibc implementation also uses linear space for large needles. – Konrad Rudolph Sep 29 '11 at 11:56
1

@Konrad: Where do you get the idea that it needs quadratic time from? Also, the glibc implementation needs O(alphabet_size) space for large needles, not space linear in the size of the needle. – j_random_hacker Sep 29 '11 at 12:09
@j_random_hacker I’ll shut up now. – Konrad Rudolph Sep 29 '11 at 12:20
4

@Konrad: Ah, I see the algorithm description was worded a bit confusingly on the page I linked. But it is indeed a weakness-free algorithm: constant-time and linear-space. That it exists should be better known and appreciated! :) What's more, it's right where it should be: in the library implementation of `strstr()`. :) – j_random_hacker Sep 29 '11 at 12:32
Whoops, I meant "linear-time and constant-space"! – j_random_hacker Oct 03 '11 at 05:30

score 19 · Answer 2 · edited Oct 03 '15 at 15:26

Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.

However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.

Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.

strstr() is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like "\r\n" in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.

Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled strlen()/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.

I took that idea and came up with a strstr() that beats glibc's strstr() for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:

Convergence SSE2 and strstr()
A better strstr() with no ASM code

If you care to see a non-SSE2 solution that dominates strstr() for target strings of more than 15 bytes, check out:

which makes use of multibyte comparisons rather than strchr(), to find a point at which to do a memcmp.

BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.

For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.

Have you tried submitting a patch to the glibc? Of course, this requires careful benchmark over a wide range of different inputs but your above explanation seems sound. — Konrad Rudolph, Oct 29 '11 at 10:30
so given the cache optimization, is it a waste of time to try to outperform the naive search with Horspool? — mikew, Mar 05 '13 at 03:40
@Mischa, Any advise how to compile this code for windows? I have no idea what is ffs and where to get it. — user626528, Apr 05 '13 at 12:01
Wow, so sorry I never answered that. For the record, the MSVC equivalent is _BitScanForward(), but it's a bit clunky. I wrap it as (pardon the code in comments please, O Editor) `static inline uint32_t intlowz(uint32_t x) { uint32_t pos; return x ? (_BitScanForward(&pos, x), pos) : 32; }` — Mischa, May 27 '17 at 23:00

score 3 · Answer 3 · answered Sep 28 '11 at 17:21

3

Without seeing your code, it's hard to say exactly. strstr is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.

Use your debugger and step through the disassembly of strstr -- you'll probably find some interesting things in there.

answered Sep 28 '11 at 17:21

Adam Rosenfield

390,455
97
512
589

1

The effects of 4-byte-comparisons and SSE on string search are quite limited, unfortunately since there exist linear dependencies between the elements. In particular, you can’t just compare in non-overlapping 4-byte blocks: even if you compare 4 bytes at a time, you still need to compare overlapping blocks, not gaining much. – Konrad Rudolph Sep 29 '11 at 06:56
@Konrad: `strstr` can basically be implemented as a loop around `strchr` and `strcmp`/`memcmp` (with `memcmp` requiring an initial `strlen`). `strchr` can be implemented with some bit-twiddling tricks and doing 4-byte loads. `memcmp` can load 4 bytes at a time from each of the two comparands and compare those words. If the two operands don't have the same alignment, you keep track of two words at a time and bit-twiddle them to get the misaligned word to compare against the other string. See `sysdeps/x86_64/memcmp.S` in glibc, for example. – Adam Rosenfield Sep 29 '11 at 17:35
@Konrad: you gain something comparing the first 4 bytes of a pattern against a 4byte sliding window, shifting and inserting the next target string byte into the bottom of a 32bit word. It's not a much as you'd wish, since mixing 8 and 32 bit instructions on the same register stalls the instruction pipeline. SSE's a different story: you can compare 16 bytes from the target string with the first byte of the pattern, replicated 16 times. This seriously rocks. See my post below re an article with more details. – Mischa Oct 28 '11 at 18:48

score 2 · Answer 4 · answered Sep 28 '11 at 17:19

Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.

You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.

strstr faster than algorithms?

4 Answers4

Linked