1

I am looking to build a smart censor in PHP using Regex for a message board. Basically, I have an array the bad words (in Regex) along with the substitution to be used for each. I detect spaces in between the letters to prevent bypassing the censor, but I'm hung up on someone having any of the bad word's letters wrapped by HTML tags. So, if "shit" is blocked, I can catch "s h i t" with any number of spaces, but if someone does sh<b>i</b>t (with the i wrapped with bold tags), it gets through. That obviously can't happen, so I'm stumped here.

Here is what I have so far:

$bad_words = array('/s\s*h\s*i\s*t/i'=>'s***');
$new_string = preg_replace(array_keys($bad_words), array_values($bad_words), $string);
return $new_string;

I've thought of wrapping $string with strip_tags() but because the rest of the post contents (not just the bad words being sought after) can contain HTML, that will destroy the whole message board post on return. Any help or insight provided would be greatly appreciated!

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Don't think about this in terms of regex until you can define your rules in English. Exactly which cases are you going to handle? What about substituting one for the letter I? Or ! for I? How about $ for S? How about punctuation between letters, like M*A*S*H? Write it out in English and then you can think about code. – Andy Lester Feb 10 '15 at 21:26

1 Answers1

3

The fact is - no matter what you add to catch swear words, if somebody wants to find a way around it, they will. And the more your try and stop it, they more false positives you will get.

Even your method now, if someone enters "Push it to github", you're going to turn it into "Pus*** to github".

Honestly, your best bet is to catch the obvious ones, and have a way to flag a post as obscene.

Some good resources to look at on this site are:

How do you implement a good profanity filter?

and

"bad words" filter

Community
  • 1
  • 1
dave
  • 62,300
  • 5
  • 72
  • 93