2

I am working with a regexp database that contains expressions with "\uXXXX", which, of course, breaks PHP PCRE.

So, two part question, is there a way to tell PCRE to accept those sequences?

And I got around the issue, luckily it was only the one sequence, by doing:

$regx = str_ireplace('\u00a7', '\xa7', $regx);

but when I was attempting to do:

$regx = preg_replace("/\\u(\w+)/i", "\x$1", $regx);

I was still getting -

Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

and it took double escaping the \u => \\\\u, not simply \\u, why is that/is there a better way? Note: I actually had to just do the same thing, and more so, to get the correct string into this post.

update: running 5.3.3 on our server

Wayne Weibel
  • 933
  • 1
  • 14
  • 22
  • you can add a `u` modifier after the regex (ie with your `i` modifier) to specify that the expression is in UTF-8. See http://php.net/manual/en/reference.pcre.pattern.modifiers.php – SDC Feb 04 '13 at 15:10
  • I did attempt that, but still received the error. The 'u' modifier would have allowed me to have § in the regex pattern instead of the sequence. What you posted below seems to be the reason for the error still occurring though. – Wayne Weibel Feb 04 '13 at 20:23

2 Answers2

1
$regx = preg_replace("/\\u(\w+)/i", "\x$1", $regx);

The reason this doesn't work is because you need to double-escape your backslashes.

As things stand, \\u is inside a PHP double-quoted string, which means that the \\ is escaped by PHP down to a single slash.

This single slash is then given to PRCE, so the regex parser just sees \u. This fails because \u is not a valid escape sequence in regex.

If you want to actually match a backslash character in a PHP regex, you need to actually supply four backslashes.

$regx = preg_replace("/\\\\u(\w+)/i", "\x$1", $regx);

Yep. It's ugly. But that's how it is.

Technically, this applies to any regex backslash, so in theory your \w should have a double backslash too, but you can get away with that, and most others, because \w has no meaning to PHP, so it doesn't parse it. This is helpful behaviour, but does make things more confusing when it goes wrong, as in this case.

SDC
  • 14,192
  • 2
  • 35
  • 48
  • so php is trying to be helpful and escape my backslashes before evaluating the regex? the double becomes a single and then the invalid escape is encountered ... – Wayne Weibel Feb 04 '13 at 20:25
  • hmm, it's not so much that it's trying to be helpful; it's just how things work because it's inside a string. The string escaping is processed by PHP as per any other string, before the expression is passed over to the PCRE engine. It's just unfortunate that PHP strings and PCRE regex both use backslash as their escape character. You'll get a similar effect if you're working with javascript strings inside a PHP string, or anything else that uses the same backslash escape char. The slashes multiply up for each language that needs to process them. – SDC Feb 05 '13 at 10:32
1

\u won't work with PHP but \x will. Explanation from PCRE documentation:

\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh    character with hex code hhhh (JavaScript mode only)

The modifier u shouldn't be forgotten. ("Javascript mode" is an "internal" flag)

An other solution to interpret Unicode sequences (\u as \U) is to use intl/Transliterator (PHP >= 5.4):

$in = '\u0041\U00000062';
$out = transliterator_create('Hex-Any')->transliterate($in);
var_dump($out); # string(2) "Ab"
julp
  • 3,860
  • 1
  • 22
  • 21