0

Suppose I have a very long string, such as a filepath, and I want to search for something in it. For example, something like the $ find command. It seems like a basic implementation of this would be along the lines of:

if(strstr(sent, word) != NULL) {
    return 1;
}

Would there be any performance difference between doing that and something like Boyer Moore? Or does strstr already do something just as efficient?

Basically, I have about a billion very long strings, and I'm looking to do a fast(ish) find on them (without any indexing), based on the most efficient substring implementation. What should I use?


Update: To give a more concrete example, let's say I have a billion filepaths I want to search through:

/archive/1002/myfile.txt
/archive/1002/newer.mov
/user/tom/local_2014version1.mov

And from this I would search either one or more strings. Example samples would be:

"1002" // would return the first two fileds
"mov version tom" // would return the first row
  • C and C++ are distinct languages. It's not relevant to tag C++ for a question about C. – François Andrieux Aug 27 '19 at 20:17
  • There's no particular general implementation for `strstr`. – Thomas Jager Aug 27 '19 at 20:17
  • 1
    I've found a post that might be of use to you: https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm – omri Aug 27 '19 at 20:19
  • 3
    Possible duplicate of [What is the fastest substring search algorithm?](https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm) – ggorlen Aug 27 '19 at 20:19
  • 1
    A billion very long strings? Like, 100 GB of data? You definitely need something better than linear search... – hyde Aug 27 '19 at 20:20
  • 1
    @ggorlen sure, I've taken a look at that -- that's where I got the above link. My question is more is (re)implementing a Boyer-Moore vs. using `strstr` -- do I need to re-do the Boyer-Moore or does the strstr already do that? –  Aug 27 '19 at 20:20
  • 1
    @hyde -- a bit less, maybe about 10GB or so. –  Aug 27 '19 at 20:26
  • Any linear search will be more "ish" than "fast". Seriously, create an index. See https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english/9513423#9513423 to understand how you would do so efficiently. – btilly Aug 27 '19 at 20:29
  • @Shared [This comment](https://stackoverflow.com/a/4208218/6243352) from the other thread states that the asker had no problem outperforming `strstr`. – ggorlen Aug 27 '19 at 20:34
  • It kind of depends on how often you're going to do this. If you need to do it once, then `strstr` is just fine. If you start it now, you'll have results before you can code up something faster. If you need to do this often, then I'd suggest implementing something like the [Aho-Corasick string search algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). – Jim Mischel Aug 27 '19 at 20:37
  • @JimMischel thanks for the suggestion. Why would you say (in the above case) using Aho-Corasick would be better than something like BM? –  Aug 27 '19 at 20:52
  • For some reason I thought you were wanting to search for multiple strings. If you only want to search for a single string, then the Boyer-Moore algorithm is probably the way to go. Aho-Corasick excels when there are multiple search strings. – Jim Mischel Aug 27 '19 at 21:55
  • @JimMischel -- well, actually the answer is a bit more complex. It depends...The user could search for "myfile" or they could also search for "mov mydir myfile archive", in which case it would be multiple strings. In that case, does the implementation change a bit? –  Aug 28 '19 at 05:21
  • 3
    *"a filepath"* is not a long string. On Linux that is generally less than 4096 characters (e.g. `PATH_MAX` characters), on windows that is 512 characters. Calling `strstr` a billion times on such strings will be just about as efficient as anything else (and a whole lot less error-prone than something you write yourself) – David C. Rankin Aug 28 '19 at 06:23
  • The Aho-Corasick algorithm excels when you're searching for "foo" or "bar" or "scooby-doo". You can find all occurrences of all strings in a single pass. – Jim Mischel Aug 28 '19 at 14:27

1 Answers1

2

Advanced search algorithms like Boyer-Moore and Aho-Corasick work by precomputing lookup tables from the string(s) to be searched for, which incurs a large start-up time. It's very unlikely that searching something as small as a pathname would be able to make up for that high overhead. You really have to be searching something like multi-page documents before those algorithms show their value.

Lee Daniel Crocker
  • 12,927
  • 3
  • 29
  • 55
  • I see, thanks for this suggestion. So in the case of the above, where we don't want to use indexes (at least to start with), what type of algorithmic search would you suggest? Is `strstr` as good as it would get on small strings? –  Aug 27 '19 at 21:11
  • 1
    Yes, I would just use strstr()--even thinking of anything else is seriously premature optimization. You can consider changing it when you have profiling data that shows it as a bottleneck, which I highly doubt it will ever be. – Lee Daniel Crocker Aug 27 '19 at 21:17
  • 2
    No. Boyer-Moore precomputes its lookup table from the string that you are searching FOR, not the one you are searching IN. This is how searching for a string of length `m` in a string of length `n` can take time `O(n/m)`. Therefore you could indeed use that. It just won't help you enough with gigabytes of data to search through. – btilly Aug 27 '19 at 21:24
  • 1
    I did word that badly, and I'll fix. But the answer stands. The size of the BM lookup table, for example, is the size of the character set, which is probably 256 bytes for 8-bit characters and more for Unicode. Building a table that size from even a small string to be searched for is overhead that would overwhelm the tiny size of a pathname to be searched. – Lee Daniel Crocker Aug 27 '19 at 21:42
  • 1
    Except that he's searching multiple strings, so the cost of initializing the lookup table is amortized. Across a billion strings, the per-search cost of building that lookup table is essentially zero. – Jim Mischel Aug 27 '19 at 21:58
  • Actually, the question makes it unclear whether the "billion strings" are needles or haystacks. If needles, then yes, Aho-Corasick might be useful. – Lee Daniel Crocker Aug 27 '19 at 21:59
  • @LeeDanielCrocker to clarify -- imagine a billion files, one search string (find myfile.mov in this database/filesystem) –  Aug 27 '19 at 22:33
  • 2
    Ah, well then that is a good candidate for Boyer-Moore. That's not at all how I read the question the first time. – Lee Daniel Crocker Aug 27 '19 at 23:36
  • @LeeDanielCrocker thanks for pointing out the confusion in how I worded the question. I've updated it with a couple concrete examples. –  Aug 28 '19 at 05:24