I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.