2

PHP Regular expression fails when non UTF 8 character found!

I need to strip 40,000 database records to grab a width and height value from a custom_size mysql table field.

The filed is in all sorts of different random formats.

The most reliable way is to grab a numeric value from the left and right side of an x and strip all non numeric values from them.

The code below works pretty good 99% of the time until it found a few records with non UTF 8 characters.

31*32 and 35”x21” are 2 examples.

When these are ran I get these PHP errors and script halts....

Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 1683977065 on line 21

Warning: preg_match(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 on line 24

Demo:

<?php

$strings = array(

    '12x12',
    '172.61 cm x 28.46 cm',
    '31"x21"',
    '1"x1"',
    '31*32',
    '35”x21”'
);


foreach($strings as $string){

    if($string != ''){

        $string = str_replace('”','"',$string);

        // Strip out all characters except for numbers, letter x, and decimal points
        $string = preg_replace( '/([^0-9x\.])/ui', '', strtolower( $string ) );

        // Find anything that fits the number X number format
        preg_match( '/([0-9]+(\.[0-9]+)?)x([0-9]+(\.[0-9]+)?)/ui', $string, $values ); 

        echo 'Original value: ' .$string.'<br>';
        echo 'Width: ' .$values[1].'<br>';
        echo 'Height: ' .$values[3].'<br><hr><br>';         

    }

}

Any ideas around this? I cannot rebuild server software to add support


Just found an answer with a PHP library to convert to UTF8 that seems to be helping a lot https://stackoverflow.com/a/3521396/143030

Community
  • 1
  • 1
JasonDavis
  • 48,204
  • 100
  • 318
  • 537
  • If your input was not utf-8, why use the `u` flag? Also the pattern does not seem to require it. – Jonny 5 Jul 14 '15 at 05:13
  • @Jonny5: If the input is Unicode text, `u` flag is a must, since it affects how the pattern is interpreted. – nhahtdh Jul 14 '15 at 08:33
  • Related: http://stackoverflow.com/questions/10037336/pcre-is-compiled-without-utf-support By the way, if you found that the other question resolves your problem, you can either close your question as duplicate, or post it as an answer, instead of editing the solution into the question. – nhahtdh Jul 14 '15 at 08:35
  • @nhahtdh He's matching only for ascii characters `0-9`, `x` and a literal `.` there's no difference. For other cases I agree with you. Further he's using `strtolower` function which is not designed for utf-8 input > pointing towards input is not multibyte else would be using `mb_strtolower`. – Jonny 5 Jul 14 '15 at 10:05

1 Answers1

2

By default, the PCRE regex-engine reads a character string one byte at a time, so, by default it ignores byte sequences that may compose a single character when a multibyte encoding like UTF-8 is in use, and see them as separated bytes (one byte, one character).

For example, the character U+201D: RIGHT DOUBLE QUOTATION MARK uses three bytes in UTF-8:

$a = '”';

for ($i=0; $i < strlen($a); $i++) {
    echo dechex(ord($a[$i])), ' ';
}

Result:

e2 80 9d

To enable the multibyte read in the PCRE regex engine, you can either use one of these directives at the beginning of the pattern: (*UTF), (*UTF8), (*UTF16), (*UTF32) or the u modifier (that switches on the available multi-bytes mode, but that extends too the meaning of the shorthand character classes like \s, \d, \w... to unicode. In other words the u modifier is a shortcut for (*UTFx) and (*UCP) that changes the character classes.)

But these features are only available if the PCRE module has been compiled with the support of these encodings. (This is the case for most of the default PHP installations, but it isn't absolutely systematic or mandatory.)

It seems that it isn't the case for you since when you use the u modifier, you obtain this explicit message:

this version of PCRE is not compiled with PCRE_UTF8 support

You can't do anything against that except if you decide to change your PHP installation by one with the PCRE module compiled with UTF8 support.

However, it isn't really a problem in your case, because in your patterns the u modifier is totally useless even if your input is UTF8 encoded.

The reason is that your two patterns use only ASCII literal characters (characters that are in the 00-7F range) and because characters beyond the ASCII range in the UTF8 encoding never use bytes from this range:

Unicode  char   UTF8    Name
--------------------------------------------------------
U+007D     }       7d   RIGHT CURLY BRACKET
U+007E     ~       7e   TILDE
U+007F             7f   <control>
U+0080          c2 80   <control>
U+0081          c2 81   <control>
...
U+00BE     ¾    c2 be   VULGAR FRACTION THREE QUARTERS
U+00BF     ¿    c2 bf   INVERTED QUESTION MARK
U+00C0     À    c3 80   LATIN CAPITAL LETTER A WITH GRAVE
U+00C1     Á    c3 81   LATIN CAPITAL LETTER A WITH ACUTE
...

So you can write:

$string = preg_replace( '/[^0-9x.]+/', '', strtolower( $string ) );

(No need to use the i modifier since your string is already lowercase. No need to escape a dot in a character class and to use a capture group. Adding the + quantifier speeds up the replacement since several consecutive characters are removed in one replacement, instead of one by one.)

and:

if (preg_match('/([0-9]+(?:\.[0-9]+)?)x([0-9]+(?:\.[0-9]+)?)/', $string, $values)) {
    echo 'Original value: ', $string, '<br>';
    echo 'Width: ', $values[1], '<br>';
    echo 'Height: ', $values[2], '<br><hr><br>';
}

However, it can be dangerous with some patterns, for example this will not remove the first character as expected if this one is encoded with several bytes, but only the first byte of this character:

$a = preg_replace('/^./', '', '”abc');

for ($i=0; $i < strlen($a); $i++) {
    echo ' ', dechex(ord($a[$i]));
}

returns:

 80 9d 61 62 63
# �  �  a  b  c
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125