1

I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.

I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.

My current approach (PHP/PCRE regex) is the following:

\b(first word|second word|many other words)\b

Is there any other better approach? Am I missing something important?

Words are about 1500.

Any help is appreciated

Robert P
  • 15,707
  • 10
  • 68
  • 112
Life after Guest
  • 299
  • 1
  • 11

1 Answers1

1

A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.

The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.

If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.

If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie

Robert P
  • 15,707
  • 10
  • 68
  • 112
  • 1
    Thanks for your kind reply. I wasn't worried about maintaining, but a little bit more about performance. There are always spaces, many times the strings are long surnames (eg. spanish surnames, sànchez y gonzalez, etc.) – Life after Guest Feb 21 '20 at 18:16
  • 1
    It will not work if you have a hundred thousand words. – Wiktor Stribiżew Feb 21 '20 at 18:18
  • 1
    @WiktorStribiżew At the moment they are 1500. They could eventually grow up to (no more than) 2000, so luckily no hundred thousands – Life after Guest Feb 21 '20 at 18:31
  • 2
    Ok, I really exaggerated, but even if you have a hundred words that start with the same prefix, backtracking might be killing. In my real work, I have had to deal with dictionaries up to 50K terms, and a mere regex like `\b(?:w1|w2|\wn)\b` did not work as it was far too slow. – Wiktor Stribiżew Feb 21 '20 at 22:03
  • 1
    @WiktorStribiżew: in my case words are very different, being basically names+surnames of employees from anywhere in the World. I appreciated your comments, thanks! – Life after Guest Feb 22 '20 at 07:10