preg_match and UTF-8 in PHP

Question

I'm trying to search a UTF8-encoded string using preg_match.

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.

I have the following settings in my php.ini, and other UTF8 functions are working:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

see http://stackoverflow.com/questions/2187615/utf-8-characters-in-preg-match-all-php — Artefacto, Aug 15 '10 at 13:53

score 52 · Answer 1 · edited Nov 04 '14 at 23:02

52

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

edited Nov 04 '14 at 23:02

Mark Amery

143,130
81
406
459

answered Nov 12 '09 at 20:56

Gumbo

643,351
109
780
844

3

"The u modifier is only to get the pattern interpreted as UTF-8, not the subject." This is not true. Compare e.g. `preg_split('//', .)` with `preg_split('//u', .)`. Since this "x is interpret as UTF-8" is a bit vague, see [this](http://www.pcre.org/pcre.txt) for the actual effects of the unicode mode. – Artefacto Aug 30 '10 at 03:41
2

According to http://nl1.php.net/manual/en/reference.pcre.pattern.modifiers.php#103348 the *u* modifier has effect on both the pattern **and the input**. – Lode Oct 19 '12 at 05:36
@LukaRamishvili Some people are of the opinion that it sucks at many things. – Michael Robinson May 13 '14 at 23:29
OK, php sucks at Unicode, but maybe with one constraint now: version<7.0. UString is coming https://wiki.php.net/rfc/ustring. UString aims to tackle the issues of working with Unicode strings – Igor Dec 05 '15 at 15:21
Fun fact, this produces different output depending upon your version. https://3v4l.org/iHl4a – bishop Jan 14 '16 at 22:11
1

@tomalak and next ones. Of course, php doesn't manage unicode, because it works on bytes if you use old functions like substr, strlen, etc., but it is fully managed since a very long time via the extension mbstring, enabled by default in many distributions and servers. This is a choice to maintain backward compatibility. – Daniel-KM Jun 22 '17 at 10:19
I have had **NO TROUBLE** with UTF-8 in PHP since I started converting all my old sites to Unicode 4-5 years ago. – TheStoryCoder Oct 31 '17 at 21:04
4

@Tomalak "Man, it's 2019 and PHP still sucks abysmally at Unicode." Please confirm. – Pathros Dec 31 '18 at 19:10

score 29 · Answer 2 · edited Jan 17 '22 at 12:17

29

Try adding this (*UTF8) before the regex:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Magic, thanks to a comment in https://www.php.net/manual/function.preg-match.php#95828

edited Jan 17 '22 at 12:17

Syscall

19,327
10
37
52

answered Feb 27 '12 at 23:12

Natxet

299
3
3

4

Interesting, although I think you need the initial `/` before the `(*UTF8)`. This doesn't work on my system, but it might on others. What does this output when you do `echo $a_matches[0][1];`? – JW. Feb 28 '12 at 00:05
2

I used it like this on PHP 5.4.29, works like a charm: `preg_match_all('/(*UTF8)[^A-Za-z0-9\s]/', $txt, $matches);` – Novalis Jul 05 '14 at 08:57
5

Doesn't work for me on either PHP 5.6 or PHP 7 on Ubuntu 16.04. `(*UTF8)` before delimiter is an error, after has no effect. I suspect that it depends on how/where you got your php, specifically the settings that `libpcre*` was compiled with. – Nov 24 '16 at 16:37
2

Does not change the offsets for me, but that's an interesting thing to know. The original documentation for that "feature" is: http://www.pcre.org/pcre.txt – BurninLeo Oct 15 '17 at 15:29

score 24 · Accepted Answer · edited Jan 24 '16 at 19:07

24

Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

edited Jan 24 '16 at 19:07

Tomalak

332,285
67
532
628

answered Nov 12 '09 at 21:10

user187291

53,363
19
95
127

4

Great...and they don't provide a mb_preg_replace. – JW. Nov 12 '09 at 22:21
Be aware that the same "rules" regarding utf-8 handling applies to the 5th parameter `$offset`. Sample: `var_dump(preg_match('/#/u', "\xc3\xa4#",$matches,0,2));` – AthanasiusKirchner Feb 24 '16 at 09:36
1

php is aware of the u modifier it's listed in the manual see "u (PCRE_UTF8)" http://php.net/manual/en/reference.pcre.pattern.modifiers.php – Walt Sorensen May 07 '17 at 14:40

score 9 · Answer 4 · edited Feb 29 '16 at 00:33

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.

     mb_internal_encoding('UTF-8');

     /**
     * Returns array of matches in same format as preg_match or preg_match_all
     * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match
     * @param string $pattern  The pattern to search for, as a string.
     * @param string $subject  The input string.
     * @param int    $offset   The place from which to start the search (in bytes).
     * @return array
     */
    function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)
    {
        $matchInfo = array();
        $method    = 'preg_match';
        $flag      = PREG_OFFSET_CAPTURE;
        if ($matchAll) {
            $method .= '_all';
        }
        $n = $method($pattern, $subject, $matchInfo, $flag, $offset);
        $result = array();
        if ($n !== 0 && !empty($matchInfo)) {
            if (!$matchAll) {
                $matchInfo = array($matchInfo);
            }
            foreach ($matchInfo as $matches) {
                $positions = array();
                foreach ($matches as $match) {
                    $matchedText   = $match[0];
                    $matchedLength = $match[1];
                    $positions[]   = array(
                        $matchedText,
                        mb_strlen(mb_strcut($subject, 0, $matchedLength))
                    );
                }
                $result[] = $positions;
            }
            if (!$matchAll) {
                $result = $result[0];
            }
        }
        return $result;
    }

    $s1 = 'Попробуем русскую строку для теста';
    $s2 = 'Try english string for test';

    var_dump(pregMatchCapture(true, '/обу/', $s1));
    var_dump(pregMatchCapture(false, '/обу/', $s1));

    var_dump(pregMatchCapture(true, '/lish/', $s2));
    var_dump(pregMatchCapture(false, '/lish/', $s2));

Output of my example:

    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(6) "обу"
          [1]=>
          int(4)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(6) "обу"
        [1]=>
        int(4)
      }
    }
    array(1) {
      [0]=>
      array(1) {
        [0]=>
        array(2) {
          [0]=>
          string(4) "lish"
          [1]=>
          int(7)
        }
      }
    }
    array(1) {
      [0]=>
      array(2) {
        [0]=>
        string(4) "lish"
        [1]=>
        int(7)
      }
    }

Can you explain what your code does instead of just pasting a code dump? And how does this answer the question? — nhahtdh, Dec 19 '14 at 05:41
It does exactly what described in comments and returns CORRECT string offsets. It is the subject of the question. No idea why I had -2 for my answer. It is working for me. — Guy Fawkes, Dec 19 '14 at 05:50
Well, that's why you should include an explanation of what your code does. People don't get what you are trying to do here. — nhahtdh, Dec 19 '14 at 07:17
To use `$offset` as `(characters)` instead of `(bytes)`, you can add this near the top of the function: `if ($offset) { $offset = strlen(mb_substr($subject, 0, $offset)); }` — Goozak, Jan 22 '20 at 22:06
Thanks. After searching on multiple questions, this is the only answer that does the trick. — Gabe Hiemstra, Apr 07 '20 at 18:45
An old comment of "necroposting", but still useful! Thank you @GuyFawkes, this helped with my current mess of code I'm working through. Cheers, jz — J.Z., Dec 08 '22 at 18:42

score 3 · Answer 5 · answered Jun 17 '22 at 22:03

You can calculate the real UTF-8 offset by cutting the string to the offset returned by the preg_mach with the byte-counting substr and then measuring this prefix with the correct-counting mb_strlen.

$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8');

score 1 · Answer 6 · answered Jun 22 '17 at 14:21

I wrote small class to convert offsets returned by preg_match to proper utf offsets:

final class NonUtfToUtfOffset
{
    /** @var int[] */
    private $utfMap = [];

    public function __construct(string $content)
    {
        $contentLength = mb_strlen($content);

        for ($offset = 0; $offset < $contentLength; $offset ++) {
            $char = mb_substr($content, $offset, 1);
            $nonUtfLength = strlen($char);

            for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) {
                $this->utfMap[] = $offset;
            }
        }
    }

    public function convertOffset(int $nonUtfOffset): int
    {
        return $this->utfMap[$nonUtfOffset];
    }
}

You can use it like that:

$content = 'aą bać d';
$offsetConverter = new NonUtfToUtfOffset($content);

preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE);

foreach ($m[1] as [$word, $offset]) {
    echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n";
    echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n";
}

https://3v4l.org/8Y32J

Danon · Answer 7 · 2019-01-18T15:34:55.500

1

You might want to look at T-Regx library.

pattern('/Hola/u')->match('\xC2\xA1Hola!')->first(function (Match $match) 
{
    echo $match->offset();     // characters
    echo $match->byteOffset(); // bytes
});

This $match->offset() is UTF-8 safe offset.

edited Jan 18 '19 at 15:34

answered Sep 24 '18 at 07:55

Danon

2,771
27
37

score 1 · Answer 8 · answered Aug 16 '11 at 21:19

1

If all you want to do is find the multi-byte safe position of H try mb_strpos()

mb_internal_encoding('UTF-8');
$str = "\xC2\xA1Hola!";
$pos = mb_strpos($str, 'H');
echo $str."\n";
echo $pos."\n";
echo mb_substr($str,$pos,1)."\n";

Output:

¡Hola!
1
H

answered Aug 16 '11 at 21:19

velcrow

6,336
4
29
21

That was just a simplified example, but this may be useful for others. – JW. Aug 16 '11 at 22:45

score 0 · Answer 9 · answered Jul 06 '23 at 15:32

The problem was solved to me just by using casual substr instead of expected mb_substr (PHP 7.4).

The mb_substr together with preg_match_all / PREG_OFFSET_CAPTURE (despite using or not using /u modifier)resulted in incorrect position when text contained euro sign symbol (€).

Also iconv and utf8_encode did not help, and I was not able to use htmlentities.

Just reverting to simple substr helped, and it worked with € and other characters correctly.

preg_match and UTF-8 in PHP

9 Answers9

Linked

Related