UTF-8 characters in preg_match_all (PHP)

Question

I have preg_match_all('/[aäeëioöuáéíóú]/u', $in, $out, PREG_OFFSET_CAPTURE);

If $in = 'hëllo' $out is:

array(1) {
[0]=>
  array(2) {
  [0]=>
    array(2) {
      [0]=>
      string(2) "ë"
  [1]=>
  int(1)
}
[1]=>
array(2) {
  [0]=>
  string(1) "o"
  [1]=>
  int(5)
  }
}
}

The position of o should be 4. I've read about this problem online (the ë gets counted as 2). Is there a solution for this? I've seen mb_substr and similar, but is there something like this for preg_match_all?

Kind of related: Is their an equivalent of preg_match_all in Python? (Returning an array of matches with their position in the string)

you should ask that in a different question, but yes... a python regex matchobject contains the match position by default mo.start() and mo.end() — Tor Valamo, Feb 02 '10 at 21:27

Artefacto · Answer 1 · 2010-08-08T19:10:30.720

This is not a bug, PREG_OFFSET_CAPTURE refers to the byte offset of the character in the string.

mb_ereg_search_pos behaves the same way. One possibility is to change the encoding to UTF-32 before and then divide the position by 4 (because all unicode code units are represented as 4-byte sequences in UTF-32):

mb_regex_encoding("UTF-32");
$string = mb_convert_encoding('hëllo', "UTF-32", "UTF-8");
$regex =  mb_convert_encoding('[aäeëioöuáéíóú]', "UTF-32", "UTF-8");
mb_ereg_search_init ($string, $regex);
$positions = array();
while ($r = mb_ereg_search_pos()) {
    $positions[] = reset($r)/4;
}
print_r($positions);

gives:

Array
(
    [0] => 1
    [1] => 4
)

You could also convert the binary positions into code unit positions. For UTF-8, a suboptimal implementation is:

function utf8_byte_offset_to_unit($string, $boff) {
    $result = 0;
    for ($i = 0; $i < $boff; ) {
        $result++;
        $byte = $string[$i];
        $base2 = str_pad(
            base_convert((string) ord($byte), 10, 2), 8, "0", STR_PAD_LEFT);
        $p = strpos($base2, "0");
        if ($p == 0) { $i++; }
        elseif ($p <= 4) { $i += $p; }
        else  { return FALSE; }
    }
    return $result;
}

score 5 · Answer 2 · answered Feb 27 '14 at 09:39

There is simple workaround, to be used after preg_match() results matched. You need to iterate every match result and reassign position value with following:

$utfPosition = mb_strlen(substr($wholeSubjectString, 0, $capturedEntryPosition), 'utf-8');

Tested on php 5.4 under Windows, depends on Multibyte PHP extension only.

score 0 · Accepted Answer · edited Jan 14 '16 at 21:50

0

PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.

I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.

About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions

edited Jan 14 '16 at 21:50

bishop

37,830
11
104
139

answered Feb 02 '10 at 21:14

Tor Valamo

33,261
11
73
81

Apparently it was planned to be fixed in PHP6, but as of yet, 2016 (6 years later) it is still only under discussion. Gotta love PHP developers. They have no actual clue. – Tor Valamo Jan 20 '16 at 02:17

score 0 · Answer 4 · answered Nov 19 '14 at 00:06

Another way how to split UTF-8 $string by a regular expression is to use function preg_split(). Here is my working solution:

    $result = preg_split('~\[img/\d{1,}/img\]\s?~', $string, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

PHP 5.3.17

UTF-8 characters in preg_match_all (PHP)

4 Answers4

Linked

Related