10

I have preg_match_all('/[aäeëioöuáéíóú]/u', $in, $out, PREG_OFFSET_CAPTURE);

If $in = 'hëllo' $out is:

array(1) {
[0]=>
  array(2) {
  [0]=>
    array(2) {
      [0]=>
      string(2) "ë"
  [1]=>
  int(1)
}
[1]=>
array(2) {
  [0]=>
  string(1) "o"
  [1]=>
  int(5)
  }
}
}

The position of o should be 4. I've read about this problem online (the ë gets counted as 2). Is there a solution for this? I've seen mb_substr and similar, but is there something like this for preg_match_all?

Kind of related: Is their an equivalent of preg_match_all in Python? (Returning an array of matches with their position in the string)

roflwaffle
  • 29,590
  • 21
  • 71
  • 94
  • 1
    you should ask that in a different question, but yes... a python regex matchobject contains the match position by default mo.start() and mo.end() – Tor Valamo Feb 02 '10 at 21:27

4 Answers4

7

This is not a bug, PREG_OFFSET_CAPTURE refers to the byte offset of the character in the string.

mb_ereg_search_pos behaves the same way. One possibility is to change the encoding to UTF-32 before and then divide the position by 4 (because all unicode code units are represented as 4-byte sequences in UTF-32):

mb_regex_encoding("UTF-32");
$string = mb_convert_encoding('hëllo', "UTF-32", "UTF-8");
$regex =  mb_convert_encoding('[aäeëioöuáéíóú]', "UTF-32", "UTF-8");
mb_ereg_search_init ($string, $regex);
$positions = array();
while ($r = mb_ereg_search_pos()) {
    $positions[] = reset($r)/4;
}
print_r($positions);

gives:

Array
(
    [0] => 1
    [1] => 4
)

You could also convert the binary positions into code unit positions. For UTF-8, a suboptimal implementation is:

function utf8_byte_offset_to_unit($string, $boff) {
    $result = 0;
    for ($i = 0; $i < $boff; ) {
        $result++;
        $byte = $string[$i];
        $base2 = str_pad(
            base_convert((string) ord($byte), 10, 2), 8, "0", STR_PAD_LEFT);
        $p = strpos($base2, "0");
        if ($p == 0) { $i++; }
        elseif ($p <= 4) { $i += $p; }
        else  { return FALSE; }
    }
    return $result;
}
Artefacto
  • 96,375
  • 17
  • 202
  • 225
5

There is simple workaround, to be used after preg_match() results matched. You need to iterate every match result and reassign position value with following:

$utfPosition = mb_strlen(substr($wholeSubjectString, 0, $capturedEntryPosition), 'utf-8');

Tested on php 5.4 under Windows, depends on Multibyte PHP extension only.

0

PHP doesn't support unicode very well, so a lot of string functions, including preg_*, still count bytes instead of characters.

I tried finding a solution by encoding and decoding strings, but ultimately it all came down to the preg_match_all function.

About the python thing: a python regex matchobject contains the match position by default mo.start() and mo.end(). See: http://docs.python.org/library/re.html#finding-all-adverbs-and-their-positions

bishop
  • 37,830
  • 11
  • 104
  • 139
Tor Valamo
  • 33,261
  • 11
  • 73
  • 81
  • Apparently it was planned to be fixed in PHP6, but as of yet, 2016 (6 years later) it is still only under discussion. Gotta love PHP developers. They have no actual clue. – Tor Valamo Jan 20 '16 at 02:17
0

Another way how to split UTF-8 $string by a regular expression is to use function preg_split(). Here is my working solution:

    $result = preg_split('~\[img/\d{1,}/img\]\s?~', $string, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

PHP 5.3.17

revoke
  • 529
  • 4
  • 9