1

I try to get some word from string, but this word maybe will have cyrillic characters, I try to get it, but all what I to do - not working.

Please help me; My code

  $str= "Продавец:В KrossАдын рассказать друзьям  var addthis_config = {'data_track_clickback':true};";
$pattern = '/\s(\w*|.*?)\s/';
preg_match($pattern, $str, $matches);
echo $matches[0];

I need to get KrossАдын.

Thaks!

bigjoy10
  • 55
  • 1
  • 8
  • 1
    possible duplicate of [UTF-8 in PHP regular expressions](http://stackoverflow.com/questions/6407983/utf-8-in-php-regular-expressions) – John M. Sep 05 '14 at 15:29

2 Answers2

3

You can change the meaning of \w by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w character class is no more [a-zA-Z0-9_] but [\p{L}\p{N}_]:

$pattern = '/\s(\w*|.*?)\s/u';

Note that the alternation in the pattern is a non-sense:

you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w* can be matched by .*? because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)

Writting $pattern = '/\s(.*?)\s/u'; does exactly the same, or better:

$pattern = '/\s(\S*)\s/u';

that avoids to use a lazy quantifier.

If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:

$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';

(*UTF8) will inform the regex engine that the original string must be read as an UTF8 string.

\p{Cyrillic} is a character class that only contains cyrillic letters.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

The issue is that your string uses UTF-8 characters, which \w will not match. Check this answer on StackOverflow for a solution: UTF-8 in PHP regular expressions

Essentially, you'll want to add the u modifier at the end of your expression, and use \p{L} instead of \w.

Community
  • 1
  • 1
John M.
  • 2,234
  • 4
  • 24
  • 36