You can change the meaning of \w
by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w
character class is no more [a-zA-Z0-9_]
but [\p{L}\p{N}_]
:
$pattern = '/\s(\w*|.*?)\s/u';
Note that the alternation in the pattern is a non-sense:
you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w*
can be matched by .*?
because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)
Writting $pattern = '/\s(.*?)\s/u';
does exactly the same, or better:
$pattern = '/\s(\S*)\s/u';
that avoids to use a lazy quantifier.
If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:
$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';
(*UTF8)
will inform the regex engine that the original string must be read as an UTF8 string.
\p{Cyrillic}
is a character class that only contains cyrillic letters.