-2

Suppose I have a fixed list of strings.

For any input string, I want to find out which string in the list is a substring of the input string, or there is no such one in the list.

My way of doing it is: for each string in the list, take it as a pattern and use regex matching to see if it matches the input string somewhere.

Is using regex an overkill approach?

Thanks.

It is in C++. The plan is to use C++ standard library or Boost library.

Tim
  • 1
  • 141
  • 372
  • 590
  • 4
    With 17k+, you surely know that code is far better than poking in the dark :) – Jan Jun 16 '16 at 20:36
  • @Jan I still don't get how these users get such high reputation and can't ask a question. The other day, it was a 32k user -.- – Ismael Miguel Jun 16 '16 at 20:37
  • 2
    @Webeng He MUST know that ANY question requires SOME code. – Ismael Miguel Jun 16 '16 at 20:37
  • 2
    There is no good reason to ask such a question because the answer is most certainly: it depends. What language? The question is currently too broad. – nicael Jun 16 '16 at 20:38
  • 1
    @IsmaelMiguel I missread what Jan was saying, deleted my comment – Webeng Jun 16 '16 at 20:38
  • 2
    regexes would let you specify the whole list of thigns to check, e.g. `/(foo|bar|baz)/` in one single call, but it's still going to boil down to a loop being executed, and likely not to be hugely more efficient (and maybe less so) than doing a bunch of strpos-type calls. – Marc B Jun 16 '16 at 20:39
  • 1
    The next step: define *good way*. Beautiful? Fast? The first is up to you to decide, the latter can be checked. – nicael Jun 16 '16 at 20:40
  • Isn't using a Trie or a SuffixTree the best method to go about what OP wants? – Quirk Jun 16 '16 at 20:41
  • I think the 17k+ points doesn't matter, maybe he earned these point in another topic... also, he answered a lot of other questions, we should be thankful for this and also answer this question professional. – Thomas Jun 16 '16 at 20:46
  • @Thomas this isn't a post which is welcome on SO, at least in this form. – nicael Jun 16 '16 at 20:49
  • 1
    @IsmaelMiguel: No. Not every question requires code. I can't see how code would help this question at all. It would just be noise. – Benjamin Lindley Jun 16 '16 at 20:55
  • 1
    it is a question, about using an algorithm... it is not like: pls program me my program... – Thomas Jun 16 '16 at 20:56
  • @Thomas Are "pls google for me" questions any better? He seems to ask now, research later. – 4castle Jun 16 '16 at 21:04
  • @4castle: I don't want the have trouble here... If you think this question is bad, then down-vote this question otherwise answer or ignore this question. But flaming I think isn't correct (I don't mean you 4castle with this flaming part...) – Thomas Jun 16 '16 at 21:11
  • 2
    Amazingly this was put on _hold_ as too broad. I would post a good answer if those put-on-hold gate keepers would cool their jets a little. The fastest way _IS_ to use a regular expression. Put the list of strings in a regex alternation like this `str1|str2|str3|str4|str5`, run the regex on each input string you get. Why is it faster this way. 1. the regex becomes a _trie_. 2. The source is searched once. The combination represents a 30-200,000 percent _increase_ in performance to _ANY_ other way you can think of... Compile the regex once and you're good to go. –  Jun 16 '16 at 21:54
  • Grab this tool to create the regex from your strings. http://www.regexformat.com/version_files/Rx5_ScrnSht01.jpg http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/_ASCII_175,000_word_Mix_A-Z_Multi_Lined.txt –  Jun 16 '16 at 21:59

2 Answers2

2

Definitely overkill. Regex requires compilation of the pattern on construction and you're just doing a linear search with a constant substring. The standard lib has just what you need: http://www.cplusplus.com/reference/string/string/find/

robinkunde
  • 1,005
  • 9
  • 12
  • 2
    A better link is for [`string::find`](http://www.cplusplus.com/reference/string/string/find/) – 4castle Jun 16 '16 at 20:42
  • You're right. I fixed my answer. – robinkunde Jun 16 '16 at 20:44
  • Thanks. Does `string::find` use Boyer-Moore algorithm to find if a string is a substring of another? If not, which algorithm does it use? – Tim Jun 16 '16 at 21:09
  • @Tim See [this question](http://stackoverflow.com/q/8869605/5743988). It appears that the implementation isn't public knowledge, but it isn't the fastest. – 4castle Jun 16 '16 at 21:11
2

In general, no, regex is not a good way to find fixed substrings within a string.

There are several algorithms for substring searching that is faster than naive byte-by-byte searching. The most popular of which is Boyer-Moore. This website lists most of the well known ones including Boyer-Moore and its variations: http://www-igm.univ-mlv.fr/~lecroq/string/index.html.

However, most regex engines actually use Boyer-Moore internally to boost performance (competition among regex engines is actually a thing). So in some cases regexp IS a good way to do it.

But. Since you mention you're using Boost then you should be able to use the boyer_moore_search() that's part of Boost directly without resorting to regex.

Do note however that Boyer-Moore is inefficient if your search string is small. There are other algorithms that beat it for small search strings. So you may want to do some research and compare the algorithms against your own typical search strings. But in general Boyer-Moore is a good bet.

slebetman
  • 109,858
  • 19
  • 140
  • 171