1

I want to check existence of the word 'açilek' in the context. Running this:

$word = 'açilek';
$article='elma  and  açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat);

Succeeds. This is expected. However, to match the word 'çilek', the code returns False which is not expected:

$word = 'çilek';
$article='elma  and  çilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns false !!!!

Additionally, it will match this word if it is a part of a word, also unexpected:

$word = 'çilek';
$article='elma  and  açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns true !!!!

Why am I seeing this behavior?

Paul Beckingham
  • 14,495
  • 5
  • 33
  • 67
Atef
  • 593
  • 1
  • 8
  • 18
  • possible duplicate of [preg\_match and UTF-8 in PHP](http://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php) – Noam Rathaus Dec 29 '13 at 12:24

2 Answers2

3

beware that UTF8 characters patterns/metacharacters are not seen as such by the PCRE engine (and may very well break the matching) if you don't provide the "u" switch, as so :

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

$mat=preg_match('/\b'. $word .'\b/u', $article);
Calimero
  • 4,238
  • 1
  • 23
  • 34
  • 1
    The solution is correct, the explanation is wrong, though. It's the `\b` anchor that is not taking Unicode word boundaries into account - otherwise you wouldn't see a match with `$mat=preg_match('/\bçilek\b/', 'açilek');`. – Tim Pietzcker Dec 29 '13 at 12:28
3

You need to use the /u modifier to make the regex (especially \b) Unicode-aware:

$mat=preg_match('/\b'. $word .'\b/u', $article);

Otherwise, \b only considers positions between ASCII alphanumerics and ASCII non-alnums as word boundaries, therefore matching between a and çilek but not between   and çilek.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561