203

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

  • Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
  • Function should return a pointer to the first match, or NULL if no match is found.
  • Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
  • Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

  • Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
  • Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
  • Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
  • Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.

Update: My current optimal algorithm is as follows:

  • For needles of length 1, use strchr.
  • For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
  • For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

The big questions left in my mind are:

  • Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
  • The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

Bonus points for:

  • Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)

Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • There are quite a number of string search algorithms listed on [Algorithms on Strings](http://en.wikipedia.org/wiki/Category:Algorithms_on_strings). You may want to describe which algorithms you've considered from this list. – Greg Hewgill Jul 06 '10 at 05:05
  • 71
    That link at the end is gold! – Carlos Nov 30 '10 at 02:19
  • 5
    I can't believe you still haven't accepted an answer. – user541686 Mar 05 '15 at 05:33
  • 2
    @Mehrdad: I was about to say there aren't any answers which really address the question as asked, but yours seems to. At the time you answered I'd moved on and left further improvement of `strstr` as something for later, so I haven't actually gotten around to properly reading the paper you linked, but it does sound very promising. Thanks and sorry for not getting back to you. – R.. GitHub STOP HELPING ICE Mar 05 '15 at 18:07

17 Answers17

45

Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

Boyer-Moore uses a bad character table with a good suffix table.

Boyer-Moore-Horspool uses a bad character table.

Knuth-Morris-Pratt uses a partial match table.

Rabin-Karp uses running hashes.

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

Edit:

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needle
short haystack         ?               ?
long haystack          ?               ?

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.

drawnonward
  • 53,459
  • 16
  • 107
  • 112
  • 1
    Do you have good suggestions for a test library? The previous question I asked on SO was related to that and I never got any real answers. (except my own...) It should be extensive. Even if my idea of an application for strstr is searching English text, somebody else's might be searching for genes in base pair sequences... – R.. GitHub STOP HELPING ICE Jul 06 '10 at 09:36
  • 3
    It's a bit more complicated than short/long. For the needle, the big questions relevant to the performance of most algorithms are: Length? Is there any periodicity? Does the needle contain all unique characters (no repeats)? Or all the same character? Are there a large number of characters in the haystack that never appear in the needle? Is there a chance of having to deal with needles provided by an attacker who wants to exploit worst-case performance to cripple your system? Etc.. – R.. GitHub STOP HELPING ICE Jul 07 '10 at 05:46
  • This library and testcode exists with **smart**, and none of your listed old algos perform good. The Leroy paper is also too old. There are about 80 improvements to BM, KMP, BMH, KP. Currently best (as of 2020) is **EPSM** in all cases. – rurban Dec 22 '20 at 08:25
37

Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching" algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.

Update:

In 2014 the authors published this improvement: Towards optimal packed string matching.

sns
  • 375
  • 1
  • 2
  • 9
user541686
  • 205,094
  • 128
  • 528
  • 886
  • 1
    Wow thanks. I'm reading the paper. If it turns out to be better than what I have, I'll definitely accept your answer. – R.. GitHub STOP HELPING ICE Aug 13 '13 at 16:09
  • 2
    @R..: Sure! :) Speaking of which, if you manage to implement the algorithm, please consider posting it on StackOverflow so everyone can benefit from it! I haven't found any implementations of it anywhere and I'm not good at implementing algorithms I find in research papers haha. – user541686 Aug 13 '13 at 16:39
  • 2
    It's a variant of the "two-way" algorithm I'm already using, so adapting my code to use this might actually be easy. I'll have to read the paper in more detail to be sure, though, and I need to evaluate whether the changes made are compatible with my use of a "bad character table" which greatly speeds up the common case. – R.. GitHub STOP HELPING ICE Aug 13 '13 at 17:40
  • 11
    And you still haven't accepted @Mehrdad's answer! :-) – lifebalance Mar 05 '15 at 03:16
  • 3
    Well, to be fair, it IS a link-only answer, which is not supposed to be allowed here. – Dawood ibn Kareem Dec 30 '16 at 04:15
  • 8
    @DavidWallace: What? It has the paper titles and the authors. Even if the link goes dead you can find the papers. What are you expecting me to do, write pseudocode for the algorithm? What makes you think I understand the algorithm? – user541686 Dec 30 '16 at 04:38
26

I was surprised to see our tech report cited in this discussion; I am one of the authors of the algorithm that was named Sustik-Moore above. (We did not use that term in our paper.)

I wanted here to emphasize that for me the most interesting feature of the algorithm is that it is quite simple to prove that each letter is examined at most once. For earlier Boyer-Moore versions they proved that each letter is examined at most 3 and later 2 times at most, and those proofs were more involved (see cites in paper). Therefore I also see a didactical value in presenting/studying this variant.

In the paper we also describe further variations that are geared toward efficiency while relaxing the theoretical guarantees. It is a short paper and the material should be understandable to an average high school graduate in my opinion.

Our main goal was to bring this version to the attention of others who can further improve on it. String searching has so many variations and we alone cannot possibly think of all where this idea could bring benefits. (Fixed text and changing pattern, fixed pattern different text, preprocessing possible/not possible, parallel execution, finding matching subsets in large texts, allow errors, near matches etc., etc.)

Matyas
  • 644
  • 6
  • 15
  • 1
    Do you happen to know of a C or C++ implementation available? I'm thinking of using this for some dna motif search (exact motif matches). If not, maybe I'll try developing an implementation myself and submitting to boost algorithm – JDiMatteo Feb 22 '15 at 18:32
  • 4
    With no known available implementation, the Sustik-Moore/2BLOCK algorithm seems unlikely to be used in practice and continue to be omitted from results in summary papers like ["The Exact String Matching Problem: a Comprehensive Experimental Evaluation"](http://arxiv.org/pdf/1012.2547v1.pdf) – JDiMatteo Feb 28 '15 at 17:36
24

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.

NealB
  • 16,670
  • 2
  • 39
  • 60
  • 1
    Thanks for the response, especially the link to Sustik-Moore which I hadn't seen before. The multiple algorithms approach is surely in widespread use. Glibc basically does strchr, Two-Way without bad character shift table, or Two-Way with bad character shift table, depending on whether needle_len is 1, <32, or >32. My current approach is the same except that I always use the shift table; I replaced the 1kb memset necessary to do so with a 32 byte memset on a bitset used to mark which elements of the table have been initialized, and I get the benefit (but not overhead) even for tiny needles. – R.. GitHub STOP HELPING ICE Jul 07 '10 at 05:54
  • 1
    After thinking about it, I'm really curious what the intended application for Sustik-Moore is. With small alphabets, you'll never get to make any significant shifts (all characters of the alphabet almost surely appear near the end of the needle) and finite automata approaches are very efficient (small state transition table). So I can't envision any scenario where Sustik-Moore could be optimal... – R.. GitHub STOP HELPING ICE Jul 07 '10 at 13:20
  • great response -- if I could star this particular answer I would. – Jason S Jul 07 '10 at 14:03
  • 2
    @R.. The theory behind the sustik-moore algorithm is that it should give you larger average shift amounts when the needle is relatively large and the alphabet is relatively small (eg. searching for DNA sequences). Larger in this case just means larger than the basic Boyer-Moore algorithm would produce given the same inputs. How much more efficient this is relative to a finite automata approach or to some other Boyer-Moore variation (of which there are many) is hard to say. That is why I emphasised spending some time to research the specific strengths/weaknesses of your candidate algorithms. – NealB Jul 07 '10 at 15:13
  • 1
    Hm, I guess I was stuck thinking of shifts just in the sense of bad character shifts from Boyer-Moore. With an improvement on BM good suffix shifts though, Sustik-Moore could possibly outperform DFA approaches to DNA searching. Neat stuff. – R.. GitHub STOP HELPING ICE Jul 07 '10 at 15:44
19

The fastest substring search algorithm is going to depend on the context:

  1. the alphabet size (e.g. DNA vs English)
  2. the needle length

The 2010 paper "The Exact String Matching Problem: a Comprehensive Experimental Evaluation" gives tables with runtimes for 51 algorithms (with different alphabet sizes and needle lengths), so you can pick the best algorithm for your context.

All of those algorithms have C implementations, as well as a test suite, here:

http://www.dmi.unict.it/~faro/smart/algorithms.php

JDiMatteo
  • 12,022
  • 5
  • 54
  • 65
  • Nope, the 2010 paper misses the latest and best, which is **epsm**, "Exact Packed String Matching". epsm is best in all cases. faro's new website is currently broken, in the internet archive you can see it. – rurban Dec 22 '20 at 08:21
  • @rurban: The 2010 paper misses the paper from 2013? – user541686 Feb 03 '21 at 03:04
  • @user541686: exactly. the latest and best was not yet known. op referred to that old paper, which misses the latest and best. – rurban Feb 11 '21 at 13:05
4

A really good question. Just add some tiny bits...

  1. Someone were talking about DNA sequence matching. But for DNA sequence, what we usually do is to build a data structure (e.g. suffix array, suffix tree or FM-index) for the haystack and match many needles against it. This is a different question.

  2. It would be really great if someone would like to benchmark various algorithms. There are very good benchmarks on compression and the construction of suffix arrays, but I have not seen a benchmark on string matching. Potential haystack candidates could be from the SACA benchmark.

  3. A few days ago I was testing the Boyer-Moore implementation from the page you recommended (EDIT: I need a function call like memmem(), but it is not a standard function, so I decided to implement it). My benchmarking program uses random haystack. It seems that the Boyer-Moore implementation in that page is times faster than glibc's memmem() and Mac's strnstr(). In case you are interested, the implementation is here and the benchmarking code is here. This is definitely not a realistic benchmark, but it is a start.

user172818
  • 4,518
  • 1
  • 18
  • 20
  • If you have some good needles to test along with the haystack candidates from the SACA benchmark, post them as an answer to my [other question](http://stackoverflow.com/questions/3134602/what-are-good-test-cases-for-benchmarking-stress-testing-substring-search-algor) and, short of getting a better answer, I'll mark it accepted. – R.. GitHub STOP HELPING ICE Jul 08 '10 at 04:23
  • 3
    About your memmem and Boyer-Moore, it's very likely that Boyer-Moore (or rather one of the enhancements to Boyer-Moore) will perform best on random data. Random data has an extremely low probability of periodicity and long partial matches which lead to quadratic worst-case. I'm looking for a way to combine Boyer-Moore and Two-Way or to efficiently detect when Boyer-Moore is "safe to use" but so far I haven't had any success. BTW I wouldn't use glibc's memmem as a comparison. My implementation of what's basically the same algorithm as glibc's is several times faster. – R.. GitHub STOP HELPING ICE Jul 08 '10 at 05:34
  • As I said, it is not my implementation. Credit to Christian Charras and Thierry Lecroq. I can imagine why random input is bad for benchmarking and I am sure glibc chooses algorithms for reasons. I also guess memmem() is not implemented efficiently. I will try. Thanks. – user172818 Jul 08 '10 at 14:08
4

A faster "Search for a single matching character" (ala strchr) algorithm.

Important notes:

  • These functions use a "number / count of (leading|trailing) zeros" gcc compiler intrinsic- __builtin_ctz. These functions are likely to only be fast on machines that have an instruction(s) that perform this operation (i.e., x86, ppc, arm).

  • These functions assume the target architecture can perform 32 and 64 bit unaligned loads. If your target architecture does not support this, you will need to add some start up logic to properly align the reads.

  • These functions are processor neutral. If the target CPU has vector instructions, you might be able to do (much) better. For example, The strlen function below uses SSE3 and can be trivially modified to XOR the bytes scanned to look for a byte other than 0. Benchmarks performed on a 2.66GHz Core 2 laptop running Mac OS X 10.6 (x86_64) :

    • 843.433 MB/s for strchr
    • 2656.742 MB/s for findFirstByte64
    • 13094.479 MB/s for strlen

... a 32-bit version:

#ifdef __BIG_ENDIAN__
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (_x == 0u)   ? 0 : (__builtin_clz(_x) >> 3) + 1; })
#else
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu);                    (__builtin_ctz(_x) + 1) >> 3; })
#endif

unsigned char *findFirstByte32(unsigned char *ptr, unsigned char byte) {
  uint32_t *ptr32 = (uint32_t *)ptr, firstByte32 = 0u, byteMask32 = (byte) | (byte << 8);
  byteMask32 |= byteMask32 << 16;
  while((firstByte32 = findFirstZeroByte32((*ptr32) ^ byteMask32)) == 0) { ptr32++; }
  return(ptr + ((((unsigned char *)ptr32) - ptr) + firstByte32 - 1));
}

... and a 64-bit version:

#ifdef __BIG_ENDIAN__
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (_x == 0ull) ? 0 : (__builtin_clzll(_x) >> 3) + 1; })
#else
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full);                    (__builtin_ctzll(_x) + 1) >> 3; })
#endif

unsigned char *findFirstByte64(unsigned char *ptr, unsigned char byte) {
  uint64_t *ptr64 = (uint64_t *)ptr, firstByte64 = 0u, byteMask64 = (byte) | (byte << 8);
  byteMask64 |= byteMask64 << 16;
  byteMask64 |= byteMask64 << 32;
  while((firstByte64 = findFirstZeroByte64((*ptr64) ^ byteMask64)) == 0) { ptr64++; }
  return(ptr + ((((unsigned char *)ptr64) - ptr) + firstByte64 - 1));
}

Edit 2011/06/04 The OP points out in the comments that this solution has a "insurmountable bug":

it can read past the sought byte or null terminator, which could access an unmapped page or page without read permission. You simply cannot use large reads in string functions unless they're aligned.

This is technically true, but applies to virtually any algorithm that operates on chunks that are larger than a single byte, including the method suggested by the OP in the comments:

A typical strchr implementation is not naive, but quite a bit more efficient than what you gave. See the end of this for the most widely used algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

It also really has nothing to do with alignment per-se. True, this could potentially cause the behavior discussed on the majority of common architectures in use, but this has more to do with microarchitecture implementation details- if the unaligned read straddles a 4K boundary (again, typical), then that read will cause a program terminating fault if the next 4K page boundary is unmapped.

But this isn't a "bug" in the algorithm given in the answer- that behavior is because functions like strchr and strlen do not accept a length argument to bound the size of the search. Searching char bytes[1] = {0x55};, which for the purposes of our discussion just so happens to be placed at the very end of a 4K VM page boundary and the next page is unmapped, with strchr(bytes, 0xAA) (where strchr is a byte-at-a-time implementation) will crash exactly the same way. Ditto for strchr related cousin strlen.

Without a length argument, there is no way to tell when you should switch out of the high speed algorithm and back to a byte-by-byte algorithm. A much more likely "bug" would be to read "past the size of the allocation", which technically results in undefined behavior according to the various C language standards, and would be flagged as an error by something like valgrind.

In summary, anything that operates on larger than byte chunks to go faster, as this answers code does and the code pointed out by the OP, but must have byte-accurate read semantics is likely to be "buggy" if there is no length argument to control the corner case(s) of "the last read".

The code in this answer is a kernel for being able to find the first byte in a natural CPU word size chunk quickly if the target CPU has a fast ctz like instruction. It is trivial to add things like making sure it only operates on correctly aligned natural boundaries, or some form of length bound, which would allow you to switch out of the high speed kernel and in to a slower byte-by-byte check.

The OP also states in the comments:

As for your ctz optimization, it only makes a difference for the O(1) tail operation. It could improve performance with tiny strings (e.g. strchr("abc", 'a'); but certainly not with strings of any major size.

Whether or not this statement is true depends a great deal on the microarchitecture in question. Using the canonical 4 stage RISC pipeline model, then it is almost certainly true. But it is extremely hard to tell if it is true for a contemporary out-of-order super scalar CPU where the core speed can utterly dwarf the memory streaming speed. In this case, it is not only plausible, but quite common, for there to be a large gap in "the number of instructions that can be retired" relative to "the number of bytes that can be streamed" so that you have "the number of instructions that can be retired for each byte that can be streamed". If this is large enough, the ctz + shift instruction can be done "for free".

johne
  • 6,760
  • 2
  • 24
  • 25
  • "For needles of length 1, use `strchr`."- You asked for the fastest substring search algorithm(s). Finding a substring of length 1 is a just a special case, one that can also be optimized. If you swap out your current special case code for substrings of length 1 (`strchr`) with something like the above, things will (possibly, depending on how `strchr` is implemented) go faster. The above algorithm is nearly 3x faster than a typical naive `strchr` implementation. – johne Jun 04 '11 at 00:11
  • A typical `strchr` implementation is not naive, but quite a bit more efficient than what you gave. See the end of this for the most widely used algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord – R.. GitHub STOP HELPING ICE Jun 04 '11 at 12:14
  • That isn't necessarily faster- the link provided only determines if an `int` contains a `0` byte, not its location (as the code in the answer does). As to whether or not one is faster than the other depends on on the microarchitecture. As to typical `strchr` implementations.. well, the above code clearly beats a typical `strchr` implementation. – johne Jun 04 '11 at 18:56
  • By the way your version with possibly misaligned reads has an insurmountable bug: it can read past the sought byte or null terminator, which could access an unmapped page or page without read permission. You simply cannot use large reads in string functions unless they're aligned. – R.. GitHub STOP HELPING ICE Jun 04 '11 at 19:42
  • As for your ctz optimization, it only makes a difference for the O(1) tail operation. It could improve performance with *tiny* strings (e.g. `strchr("abc", 'a');` but certainly not with strings of any major size. – R.. GitHub STOP HELPING ICE Jun 04 '11 at 19:43
  • 2
    OP said string was properly null terminated, so your discussion about `char bytes[1] = {0x55};` is irrelevant. Very relevant is your comment about this being true for any word read algorithm which does not know the length beforehand. – Seth Robertson Jun 05 '11 at 02:17
  • 1
    The problem does not apply to the version I cited because you only use it on aligned pointers - at least that's what correct implementations do. – R.. GitHub STOP HELPING ICE Jun 05 '11 at 03:12
  • @seth-robertson, you're right. And if you were to use the code in this answer in a standards compliant `strchr`, you would be required to check for `'\0'` as well as the character to check. – johne Jun 05 '11 at 19:18
  • 2
    @R, it has nothing to do with "aligned pointers". Hypothetically, if you had an architecture which supported VM protection with byte level granularity, and each `malloc` allocation was "sufficiently padded" on either side *and* the VM system enforced byte granular protection for that allocation.... whether or not the pointer is aligned (assuming trivial 32-bit `int` natural alignment) is moot- it is still possible for that aligned read to read past the size of the allocation. ***ANY*** read past the size of the allocation is `undefined behavior`. – johne Jun 05 '11 at 19:28
  • 5
    @johne: +1 to comment. Conceptually you're right, but the reality is that byte-granularity protections are so expensive both to store and to enforce that they don't and never will exist. If you know the underlying storage is page-granularity mappings obtained from the equivalent of `mmap`, then alignment is sufficient. – R.. GitHub STOP HELPING ICE Jun 13 '11 at 02:16
  • Do you mind explaining your findFirstZeroByte32() macro? I'm especially having problems understanding the 0x7F7F7F7Fu part. – asdf Jul 01 '11 at 15:47
  • I don't see any SSE3 insns in Apple's [i386 strlen](http://opensource.apple.com/source/Libc/Libc-594.9.4/i386/string/strlen.s), only SSE2. `pcmpeqb`, `pmovmskb`, `pxor`, and `movdqa` are all SSE2. Probably they just say SSE3 in comments because SSE3 was Apple's baseline for x86 CPUs. (The original apple x86s were Core Duo, and SSSE3 was new in Core **2** Duo (Merom/Conroe). Kinda too bad Apple had to support a CPU that couldn't run 64bit code, since they were one CPU generation too early.) – Peter Cordes Oct 09 '15 at 15:11
4

I know it's an old question, but most bad shift tables are single character. If it makes sense for your dataset (eg especially if it's written words), and if you have the space available, you can get a dramatic speedup by using a bad shift table made of n-grams rather than single characters.

Timothy Jones
  • 21,495
  • 6
  • 60
  • 90
3

The Two-Way Algorithm that you mention in your question (which by the way is incredible!) has recently been improved to work efficiently on multibyte words at a time: Optimal Packed String Matching.

I haven't read the whole paper, but it seems they rely on a couple of new, special CPU instructions (included in e.g. SSE 4.2) being O(1) for their time complexity claim, though if they aren't available they can simulate them in O(log log w) time for w-bit words which doesn't sound too bad.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
3

You could implement, say, 4 different algorithms. Every M minutes (to be determined empirically) run all 4 on current real data. Accumulate statistics over N runs (also TBD). Then use only the winner for the next M minutes.

Log stats on Wins so that you can replace algorithms that never win with new ones. Concentrate optimization efforts on the winningest routine. Pay special attention to the stats after any changes to the hardware, database, or data source. Include that info in the stats log if possible, so you won't have to figure it out from the log date/time-stamp.

Guy Gordon
  • 740
  • 6
  • 13
3

I recently discovered a nice tool to measure the performance of the various available algos: http://www.dmi.unict.it/~faro/smart/index.php

You might find it useful. Also, if I have to take a quick call on substring search algorithm, I would go with Knuth-Morris-Pratt.

Sandeep Giri
  • 821
  • 1
  • 7
  • 14
3

Here's Python's search implementation, used from throughout the core. The comments indicate it uses a compressed boyer-moore delta 1 table.

I have done some pretty extensive experimentation with string searching myself, but it was for multiple search strings. Assembly implementations of Horspool and Bitap can often hold their own against algorithms like Aho-Corasick for low pattern counts.

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
3

The fastest is currently EPSM, by S. Faro and O. M. Kulekci. See https://smart-tool.github.io/smart/

"Exact Packed String Matching" optimized for SIMD SSE4.2 (x86_64 and aarch64). It performs stable and best on all sizes.

The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.

Update 2020: EPSM was recently optimized for AVX and is still the fastest.

rurban
  • 4,025
  • 24
  • 27
3

Just search for "fastest strstr", and if you see something of interest just ask me.

In my view you impose too many restrictions on yourself (yes we all want sub-linear linear at max searcher), however it takes a real programmer to step in, until then I think that the hash approach is simply a nifty-limbo solution (well reinforced by BNDM for shorter 2..16 patterns).

Just a quick example:

Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 3041%, 6801754 skips/iterations Railgun_Quadruplet_7Hasherezade_hits/Railgun_Quadruplet_7Hasherezade_clocks: 0/58 Railgun_Quadruplet_7Hasherezade performance: 3483KB/clock

Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 1554%, 13307181 skips/iterations Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 0/83 Boyer_Moore_Flensburg performance: 2434KB/clock

Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ... Skip-Performance(bigger-the-better): 129%, 160239051 skips/iterations Two-Way_hits/Two-Way_clocks: 0/816 Two-Way performance: 247KB/clock

Sanmayce,
Regards

Georgi
  • 148
  • 7
2

You might also want to have diverse benchmarks with several types of strings, as this may have a great impact on performance. The algos will perform differenlty based on searching natural language (and even here there still might be fine grained distinctions because of the different morphologoies), DNA strings or random strings etc.

Alphabet size will play a role in many algos, as will needle size. For instance Horspool does good on English text but bad on DNA because of the different alphabet size, making life hard for the bad-character rule. Introducing the good-suffix allieviates this greatly.

1

Use stdlib strstr:

char *foundit = strstr(haystack, needle);

It was very fast, only took me about 5 seconds to type.

Conrad Meyer
  • 2,851
  • 21
  • 24
0

I don't know if it's the absolute best, but I've had good experience with Boyer-Moore.

R Samuel Klatchko
  • 74,869
  • 16
  • 134
  • 187
  • Do you know a way to combine Boyer-Moore's bad shift table with Two-Way? Glibc does a variant of this for long needles (>32 byte) but only checks the last byte. The problem is that Two-Way needs to search the right portion of the needle left-to-right, whereas Boyer-Moore's bad shift is most efficient when searching from right-to-left. I tried using it with left-to-right in Two-Way (advance by shift table or normal Two-Way right half mismatch, whichever is longer) but I got a 5-10% slowdown versus normal Two-Way in most cases and couldn't find any cases where it improved performance. – R.. GitHub STOP HELPING ICE Jul 06 '10 at 05:28