5
echo preg_match("/\b(בדיקה|מילה)\b/iu", "זוהי בדיקה");

For some reason, this code returns 1 on several servers I've tested it on, but 0 on one specific server.

PCRE is compiled with UTF-8 support and Unicode properties support. What could be the issue?

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
Lior
  • 5,454
  • 8
  • 30
  • 38
  • `locale` from this server? default charset from web server? check in "firebug" headers the response of encoding – ZiTAL Apr 08 '12 at 15:24
  • @ZiTAL It's the same as on the other servers: Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3 – Lior Apr 08 '12 at 15:38
  • what is the ourput os this? `print_r(mb_detect_order());` – ZiTAL Apr 08 '12 at 15:54
  • 1
    make sure the file is encoded with UTF8, some file copy programs change encodings. Otherwise the hint by @ZiTAL, that the locale might play a role: http://www.php.net/manual/en/class.locale.php – j13r Apr 08 '12 at 16:32
  • @ZiTAL Array ( [0] => ASCII [1] => UTF-8 ) on all of the servers. Any other ideas? This is really frustrating. Also, the file is encoded in UTF-8, of course. – Lior Apr 10 '12 at 01:21
  • print result from `locale -a` from server – ZiTAL Apr 10 '12 at 07:08
  • Is php version the same? Regular expressions aren't easy when working with unicode.. to quote "you should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string." from http://www.regular-expressions.info/php.html .. so just writing a multibyte string might not be enough and you would need to transform it into \x format – Artjom Kurapov Apr 10 '12 at 08:09
  • can you exec this example with php-cli instead of web server? i think that is web server encoding problem – ZiTAL Apr 10 '12 at 08:51
  • That's bound to changes with the PCRE library, earlier `\b` was non UTF-8/Unicode, nowadays it's available for UTF-8/Unicode as well (same for `\w` etc.). As the two answers already say as well. – hakre Apr 12 '12 at 00:05
  • I have no answer, but whoever solves this, please have a look here, as he/she can get more bounty for a problem that looks very similar... http://stackoverflow.com/questions/9741240/utf8-problems-with-in-php-in-solaris – eillarra Apr 13 '12 at 19:14

2 Answers2

2

There may be difference between PCRE versions which PHP use.

PHP and PCRE versions: http://php.net/pcre.installation

You should use 8.10+ (PHP 5.3.4+)

Version 8.10 25-Jun-2010:

  1. Added PCRE_UCP to make \b, \d, \s, \w, and certain POSIX character classes use Unicode properties. (*UCP) at the start of a pattern can be used to set this option. Modified pcretest to add /W to test this facility. Added REG_UCP to make it available via the POSIX interface.

Edit: Just done some tests and it gives 1 on PHP 5.3.10 and 0 on PHP 5.3.2 and PHP 5.3.3.

Naki
  • 1,616
  • 1
  • 15
  • 17
1

It might depend on version of PCRE lib. To make things more normalized, try using the «UCP verb»: preg_match('/(*UCP)\b(בדיקה|מילה)\b/iu', 'זוהי בדיקה').

Still it requires PCRE v8.10, shipped with PHP since 5.3.4 or when mentioned in a compile flag --with-pcre-regex=DIR.

Ref (in russian)

Andrew
  • 1,203
  • 8
  • 12