php preg_match_all() 70 times for each word | Api endpoint | performance

Question

I have a list of 70 words. This list is used to check user input. The user input is a text, which has on average 30-100 words. If one of the words from my list is in the text then the user text is removed, otherwise it is allowed. In most cases it will be allowed, so it will loop through all words.

To check whether the words are in the user text I use:

$susWords = SuspiciousWord::where('checked', true)->get();

$foundSusWord = false;
foreach ($susWords as $word) {
    if (preg_match_all("/" . $word->word . "/i", $user->flirttext)) {
        $foundSusWord = true;     
    break;
    }
}

I am not an expert when it comes to regex and performance. Could performance be an issue here?

Why regex and not `stripos() !== false`? – Justinas Aug 19 '20 at 13:19 — Justinas, Aug 19 '20 at 13:19

score 2 · Accepted Answer · answered Aug 19 '20 at 13:23

2

Use stripos($user->flirttext, $word->word) !== false to faster check as there is no need for regex.
Use preg_match('/\b(' . implode('|', array_column($susWords, 'word')) . ')\b/', $user->flirttext) to check for all words at once

answered Aug 19 '20 at 13:23

Justinas

41,402
5
66
96

which of these is faster? two of my words are patterns, here is an example \sig\s. I could use the first approach for all non pattern words and the regex approach for patterns. – Roman Aug 19 '20 at 13:36
@Roman `\sig\s` is covered with `\big\b`. Regex is _always_ slower than simple string operations – Justinas Aug 19 '20 at 13:37
Yes, I understand, but should I just use your second example for all words or shall I split it into two functions. Would it be worth it? I have only 2 words with patterns – Roman Aug 19 '20 at 13:38
the second example has errors, it wont find the word if it is part of a string, for example word `sex` will not be found if user input is `sexy`, which is bad, my original regex has found it. – Roman Aug 19 '20 at 13:49
@Roman Well, you did not provide correct/incorrect examples. If it can be any substring of words, then remove `\b`. In that case your `\sig\s` would still work. `\b` means word boundary – Justinas Aug 19 '20 at 14:06
Thanks works great now. I am still not sure whether I should split my wordlist into two lists: patternlist and wordlist. I would use the first appraoch for wordlist and the second approach for patternlist, would it be worth it for performance? – Roman Aug 19 '20 at 14:15
1

Since you are joining via `|`, then there is no point in making two lists – Justinas Aug 20 '20 at 05:29

score 0 · Answer 2 · answered Aug 19 '20 at 13:19

0

You can use strpos()

https://www.php.net/manual/en/function.strpos.php

Much more efficient than regex.

Some benchmark is here: https://stackoverflow.com/a/6433599/9470935

answered Aug 19 '20 at 13:19

pjplonka

95
8

akabaka · Answer 3 · 2020-08-19T13:45:04.807

-1

EDIT: as pointed out by @Justinas, this method is not really good if just punctuation is in the text. should not be used in that case at all. leaving it here as a reference

you can also use array_intersect to avoid loops:

$wordlist = explode(' ', $user->flirttext));
if (count(array_intersect($susWords, $wordlist)) > 0) {
    // found a bad word, do something
}

see doc here

edited Aug 19 '20 at 13:45

answered Aug 19 '20 at 13:32

akabaka

115
1
9

What if you search for `hello` and user has entered `foo bar hello.???`? – Justinas Aug 19 '20 at 13:35
you're right, indeed, it would need a lot more parsing with the possibility of missing some other characters thus not a really good way. i'll leave the answer for other to see it and not make the same mistake but edited to warn people – akabaka Aug 19 '20 at 13:41

php preg_match_all() 70 times for each word | Api endpoint | performance

3 Answers3