How to get correct list position in multi-byte string using preg_match

Question

I am currently matching HTML using this code:

preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)

It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.

For example the returned $match array would give something like:

array
  0 => 
    array
      0 => string '<br />' (length=6)
      1 => int 132
  1 => 
    array
      0 => string 'br' (length=2)
      1 => int 133

The real number for the <br /> match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.

If you're curious what I'm using this for: http://stackoverflow.com/questions/1193500/php-truncate-html-ignoring-tags — Dave Stein, Mar 30 '12 at 21:51
Does [this][1] helps? [1]: http://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php — haltabush, Mar 30 '12 at 22:03
Yeah it's helping. Surprised these threads didn't come up in the suggestions when I was asking — Dave Stein, Mar 30 '12 at 22:14
I actually can correct the position manually by counting the number of mb characters before the point I'm at in my function. I just can't figure a good regex for all "standard" characters on an English keyboard. — Dave Stein, Mar 30 '12 at 22:23
Maybe you can use http://stackoverflow.com/a/3432593/107152 "change the encoding to UTF-32 before and then divide the position by 4". — Qtax, Mar 30 '12 at 23:44
@Qtax I was able to get it to work based on that link, without dividing somehow. Posting the answer — Dave Stein, Apr 02 '12 at 14:01

score 3 · Answer 1 · answered Nov 09 '12 at 23:17

3

If you need quick fix and don't care about speed:

$mb_pos = mb_strlen( substr($string, 0, $pos) );

answered Nov 09 '12 at 23:17

psycho brm

7,494
1
43
42

score 3 · Accepted Answer · edited May 23 '17 at 12:24

I looked at this suggestion from @Qtax:

UTF-8 characters in preg_match_all (PHP)

And for some more reference, this bug surfaced while using this: Truncate text containing HTML, ignoring tags

The gist of the change is this:

$orig_utf = 'UTF-8';
$new_utf  = 'UTF-32';

mb_regex_encoding( $new_utf );

$html     = mb_convert_encoding( $html, $new_utf, $orig_utf );
$end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );


mb_ereg_search_init( $html );

$pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
$pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );

while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {

  $tag_position = $tag_match[0]/4;
  $tag_length   = $tag_match[1];
  $tag          = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
  $tag_name     = preg_replace( '/[\s<>\/]+/', '', $tag );

  // Print text leading up to the tag.
  $str = mb_substr($html, $position, $tag_position - $position, $new_utf );

  .......

}

Also in reference to the truncate HTML page, there are other neccessary changes:

$first_char = mb_substr( $tag, 0, 1, $new_utf );

if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
  ...
}

My text editor is UTF-8 so if I was comparing the 32 to my file's ampersand, it wouldn't work.

score -1 · Answer 3 · answered Mar 30 '12 at 22:03

-1

Have you looked into http://www.php.net/manual/en/function.mb-ereg.php ?

answered Mar 30 '12 at 22:03

imsky

3,239
17
16

It doesn't have the same options I need: `( string $pattern , string $subject [, array &$matches [, int $flags = 0 [, int $offset = 0 ]]] )` vs `( string $pattern , string $string [, array $regs ] )` – Dave Stein Mar 30 '12 at 22:10

How to get correct list position in multi-byte string using preg_match

3 Answers3

Linked