2

I've got an array of string prefixes: std::vector<std::string> haystack = {"/bin/", "/usr/bin/", "/usr/local/bin/"}.

Is there an efficient way to find that a std::string needle = "/bin/echo" starts with a sub-string from haystack, using standard C++ library?

If I would need to find the exact match, I could use std::set<std::string>, which would perform an efficient binary search, however I need to match only the first part of the string, so currently I'm doing it using a simple loop:

for (auto it = haystack.begin(); it != haystack.end(); it++) {
    if (needle.compare(0, it->size(), *it) == 0) {
        return true; // Found it
    }
}
return false;
pelya
  • 4,326
  • 2
  • 24
  • 23
  • 3
    Define _efficient_ please. To shorten the code there's `std::find_if()`. – πάντα ῥεῖ Sep 06 '16 at 17:39
  • Faster than iterating through the whole `haystack` array, which would be `O(n)`. `find_if` will perform the exact same loop with `O(n)` speed. – pelya Sep 06 '16 at 17:40
  • 1
    Divide and conquer. But even that can't guarantee being faster than O(n). – πάντα ῥεῖ Sep 06 '16 at 17:41
  • Be more precise about search string and searched text. Do you want to test any prefix for each string. Do all prefix are directory ending with /. Do you want to find file at any level under prefix directories or only immediate child. That kind of details would help finding the most efficient way to do it. It would also depends on the data size... – Phil1970 Sep 06 '16 at 17:47
  • Are the prefixes sorted? What do you mean by "prefix": full match or partial match? Would "/bin/foo/echo" match "/bin/"? Can prefix contain something like "/usr/lib" and match "/usr/lib64/baz"? – kfsone Sep 06 '16 at 17:48
  • @πάνταῥεῖ: I don't think this is a duplicate of that question since those answers only cover half the question (if at all). – Violet Giraffe Sep 06 '16 at 17:50
  • @VioletGiraffe Feel free to vote for reopening. – πάντα ῥεῖ Sep 06 '16 at 17:51
  • Sounds like you are implementing some sort of user-initiated auto-completion. In that case, do you have a reason to make anything *"more efficient"*? Did profiling indicate a bottleneck? – IInspectable Sep 06 '16 at 17:51
  • Yes, in my example "/bin/foo/echo" would match "/bin/". I don't need to go through the rest of the array, if one match is found. – pelya Sep 06 '16 at 17:51
  • @pelya There's `break;`?? – πάντα ῥεῖ Sep 06 '16 at 17:52
  • Yes, I've updated my code example to reflect that. – pelya Sep 06 '16 at 17:56
  • One really efficient algorithm would be a trie structure, used for autocompletion, modified to check for a complete string match instead of a partial results, the efficiency would be `O(needle.size())` instead of `O(haystack.size() * needle.size())`. Unfortunately the trie is not a part of C++ standard library. – pelya Sep 06 '16 at 18:13

2 Answers2

0
  1. Sort hasystack by descending string length.
  2. Compare prefixes no longer than needle in that order; do it right to left. E. g. if prefix is 5 characters long, compare prefix[4] to needle[4], then prefix[3] to needle[3] and so on.

This way you will instantly discard many non-matches. As a bonus, you will find the longest match first (might just be what you want).

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
  • Why sort the haystack? That turns this into an `O(n log n)` operation in the haystack's size. It's `O(n)` otherwise. – Richard Sep 06 '16 at 18:00
  • @Richard: this isn't strictly necessary. Assuming it is not required to find the longest matches first, sorting may or may not be beneficial depending on the specifics. – Violet Giraffe Sep 06 '16 at 18:05
0

The one "optimization" I'd add is that if you use std::any_of it will short-circuit upon finding the first substring match

auto found = std::any_of(begin(haystack),
                         end(haystack),
                         [&needle](std::string const& sub)
                         { 
                             return needle.compare(0, sub.size(), sub) == 0;
                         });

Otherwise if you want to find which substring matched, you could use std::find_if, which will also short-circuit upon finding the first match.

auto match = std::find_if(begin(haystack),
                          end(haystack),
                          [&needle](std::string const& sub)
                          { 
                              return needle.compare(0, sub.size(), sub) == 0;
                          });
Cory Kramer
  • 114,268
  • 16
  • 167
  • 218