0

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.

How do I go about matching both Latin and non-Latin keywords.

The badwords.txt file includes one word per line as in this example

bad

nasty

racist

سفالة

وساخة

جنس

Code used for matching:

$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);

foreach ($badwords as $key => $val) {
    if (!empty($val)) {
        $val = trim($val);
        $regexp = "/\b" . $val . "\b/i";
        if (preg_match($regexp, $query))
            $badFlag = 1;

        if ($badFlag == 1) {
           // Bad word detected die...
        }
    }
}

I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.

Yallaa
  • 63
  • 1
  • 8

2 Answers2

2

The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set

$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';

and modify the regexp as follows:

$regexp = "/" . $wstart . $val . $wend . "/iu";
Community
  • 1
  • 1
Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • Thank you Jukka, This is exactly what I needed, it finally works. I would not have thought that the problem would be what it turned out to be. The boundaries regexp is actually what always stayed constant in my testing of various suggestions. Thanks a lot. – Yallaa Dec 26 '11 at 22:29
0

Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.

It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.

As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.

Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).

If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).

You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

Abhi Beckert
  • 32,787
  • 12
  • 83
  • 110
  • 1
    Thank you Abhi for responding. The file is indeed saved using UTF-8, and the query is coming from a UTF-8 encoded page using meta charset=utf-8". I've used mb_strtolower() before along with mb_ereg_match() which still matches English keywords but not Arabic. This is not related to the browser default language, but merely to match the existence of the queried keyword in the badwords.txt file, then do further processing, all presentation pages are UTF-8 encoded pages. Any further ideas would be appreciated. Thanks – Yallaa Dec 25 '11 at 23:27