I am looking to build a smart censor in PHP using Regex for a message board. Basically, I have an array the bad words (in Regex) along with the substitution to be used for each. I detect spaces in between the letters to prevent bypassing the censor, but I'm hung up on someone having any of the bad word's letters wrapped by HTML tags. So, if "shit" is blocked, I can catch "s h i t" with any number of spaces, but if someone does sh<b>i</b>t
(with the i wrapped with bold tags), it gets through. That obviously can't happen, so I'm stumped here.
Here is what I have so far:
$bad_words = array('/s\s*h\s*i\s*t/i'=>'s***');
$new_string = preg_replace(array_keys($bad_words), array_values($bad_words), $string);
return $new_string;
I've thought of wrapping $string with strip_tags() but because the rest of the post contents (not just the bad words being sought after) can contain HTML, that will destroy the whole message board post on return. Any help or insight provided would be greatly appreciated!