0

I found this link and am working off of it, but I need to extend it a little further. Check if string contains word in array

I am trying to create a script that checks a webpage for known bad words. I have one array with a list of bad words, and it compares it to the string from file_get_contents.

This works at a basic level, but returns false positives. For example, if I am loading a webpage with the word "title" it returns that it found the word "tit".

Is my best bet to strip all html and punctuation, then explode it based on spaces and put each individual word into an array? I am hoping there is a more efficient process then that.

Here is my code so far:

$url = 'http://somewebsite.com/';
$content = strip_tags(file_get_contents($url));

//list of bad words separated by commas
$badwords = 'tit,butt,etc'; //this will eventually come from a db
$badwordList = explode(',', $badwords);

foreach($badwordList as $bad) {
    $place = strpos($content, $bad);
    if (!empty($place)) {
        $foundWords[] = $bad;
    }
}

print_r($foundWords);

Thanks in advance!

Community
  • 1
  • 1
Developer Gee
  • 362
  • 3
  • 12

1 Answers1

2

You can just use a regex with preg_match_all():

$badwords = 'tit,butt,etc'; 
$regex = sprintf('/\b(%s)\b/', implode('|', explode(',', $badwords)));

if (preg_match_all($regex, $content, $matches)) {
    print_r($matches[1]);
}

The second statement creates the regex which we are using to match and capture the required words off the webpage. First, it splits the $badwords string on commas, and join them with |. This resulting string is then used as the pattern like so: /\b(tits|butt|etc)\b/. \b (which is a word boundary) will ensure that only whole words are matched.

This regex pattern would match any of those words, and the words which are found in the webpage, will be stored in array $matches[1].

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • Now if you can do me one more favor. The first code you posted still returned the false positive, but your update fixed it. Can you explain the the \b does? For the life of me I can not wrap my head around regex. – Developer Gee Nov 03 '14 at 19:14
  • @DeveloperGee: As I mention in the answer, `\b` asserts the position at a word boundary; basically anywhere between a word-character (letters, numbers etc.) and a non-word character (everything else). For more information, see http://www.regular-expressions.info/wordboundaries.html – Amal Murali Nov 03 '14 at 19:18