4

I'm trying to get this regex to work which is intended for finding both two words in a sentence.

echo (int)preg_match('/\bHello\W+(?:\w+\W+){0,6}?World\b/ui', 'Hello, world!', $matches).PHP_EOL;
print_r($matches);

And it works perfectly:

1
Array
(
    [0] => Hello, world
)

... but only with latin words. If I'm switching to unicode, it doesn't find anything. There is also no need to look on the syntax because it's from a book (chapter 8. "Find Two Words Near Each Other"). The problem is that it works for latin words only but not for unicode strings like this: 'Привіт, світу!' (in Ukrainian).

And I checked out almost every possible problem:

✓ I'm using the 'u' flag in the regex pattern.

✓ I'm enabling UTF-8 support in the code before executing this statement like this:

 ini_set('default_charset', 'UTF-8');
 mb_internal_encoding('UTF-8');
 mb_regex_encoding('UTF-8');

✓ My PCRE on Debian Linux is compiled correctly:

 # pcretest -C
 PCRE version 8.02 2010-03-19
 Compiled with
   UTF-8 support
   Unicode properties support
   Newline sequence is LF
   \R matches all Unicode newlines
   Internal link size = 2
   POSIX malloc threshold = 10
   Default match limit = 10000000
   Default recursion depth limit = 10000000
   Match recursion uses stack

✓ I even tried adding this weird sequence (*UTF8) to the pattern according to this answer here but it didn't help:

echo (int)preg_match('/(*UTF8)\bПривіт\W+(?:\w+\W+){0,6}?світу\b/ui', 'Привіт, світу!', $matches).PHP_EOL;
print_r($matches);

The result:

0
Array
(
)

So my question is: why is unicode not working here when it's perfectly working for other unicode patterns I'm using in the same code? They are a bit simpler though, like this:

echo (int)preg_match('/Привіт/ui', 'Привіт, світу!', $matches).PHP_EOL;
print_r($matches);

This surprisingly works:

1
Array
(
    [0] => Привіт
)

And finally funny enough it totally works fine on this online regex tester (that's why I'm so frustrated actually, I tested it and then expected to work in my code too, but it doesn't).

Oh the wise Stackoverflow, please give he a hint.

Community
  • 1
  • 1
  • 1
    This might be too obvious, but is your file saved as utf8? – colburton Jun 14 '14 at 09:52
  • Could you please indicate your PHP version and OS? – user3740011 Jun 14 '14 at 09:53
  • On my windows system your REGEX runs as intended. On a unbuntu machine it generates this warning: "PHP Warning: preg_match(): Compilation failed: (*VERB) not recognized at offset 5". Do you have set the errorlevel to warnings? – Alex Monthy Jun 14 '14 at 12:18
  • I have the following code: ini_set( "display_errors", "on" ); error_reporting( E_ALL ); And the script doesn't generate any errors, both on runtime and in logs :( – user3740011 Jun 14 '14 at 19:05

1 Answers1

1

I had a similar problem once and discovered that UTF-8 symbols inside patterns are not working on some versions of PHP. Even 5.3 version, which was current then, had this problem. Check out your example here: http://3v4l.org/7HurJ. According to that test, you have to have at least 5.3.4 to have that pattern working, but I think, version number doesn't really mean much here. Maybe, it actually depends on some compile option, or maybe there is a workaround, but I didn't dig deeper and simply adjusted my approach not to use any "funny" symbols in expressions.

Gas Welder
  • 565
  • 5
  • 12
  • Thank you very much, and for this great tool too. It really is the php version related issue. The VPS I'm using is a little bit old (Debian 6, 512 MB RAM, ordered a few years ago but still useful). I tried on PHP 5.6.0beta3 on local Debian machine and it looks flawless. Once again, sorry for the troubles and thanks for the help. This is the solution, updating the PHP version. – user3740011 Jun 14 '14 at 20:51