PHP regex case-insensitive on cyrillic charset

Question

I am using preg_replace and preg_match with PHP, working in this charset: Cyrillic Windows 1251. I am trying to match a word using the case-insensitive modifier.

I made these tests :

$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$subject = 'Am I able to find MYCyrILlicWord1?';
$res = preg_replace($pattern, 'matched', $subject);

On UTF-8 :

With the utf-8 modifier in the pattern :

$pattern = '/myCyrillicWord1|myCyrillicWord2/iu';
$output = 'Am I able to find matched or not';

Without :

$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';

On Windows 1251 :

$pattern = '/myCyrillicWord1|myCyrillicWord2/i';
$output = 'Am I able to find MYCyrILlicWord1 or not';

The regex is functionnal on utf-8 but not on Windows 1251. Please notice that I had tested with cyrillics characters like 'х' and 'Х' (which look like latin letters 'x' and 'X').

My question is to know if that behavior is normal ?

How can I match my cyrillics words in Windows 1251 charset with the case-insensitive modifier ?

Many thanks.

score 2 · Accepted Answer · answered May 20 '14 at 17:43

2

I don't think PCRE supports charsets, so your options are basically

convert everything to utf8, process and then convert back, or
use manually crafted regexes for case-insensitivity, like /[Дд][Ыы][Кк]/ to match Дык, дыК etc

answered May 20 '14 at 17:43

gog

10,367
2
24
38

Indeed it doesn't. So, you're right as these two variants looks like the answer. – Wladyslaw Brodsky May 21 '14 at 07:45
Ok many thanks for answer. I choose the second variants because it's too heavy to convert the whole page to utf-8 and convert back to its original encoding. It's ok for the moment as I have about ten words to check. I would feel disapointed if I have hundreds word to check and in that case this solution won't suit. – KNTH May 21 '14 at 09:38

PHP regex case-insensitive on cyrillic charset

1 Answers1