Matching Unicode letter characters in PCRE/PHP

Question

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:

// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";

This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.

Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?

Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.

score 33 · Accepted Answer · answered Feb 13 '11 at 09:38

33

I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.

Your regex should be:

// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';

answered Feb 13 '11 at 09:38

NikiC

100,734
37
191
225

Weird. Try `$namePattern = '/^[\pL]$/'; $a = '张'; var_dump(preg_match($namePattern, $a)); $a = '张1'; var_dump(preg_match($namePattern, $a));` and variants. It does work for me without u. PHP 5.3.2-1ubuntu4.7 – chx Feb 13 '11 at 09:42
@chx: Gives me `int(0)`. Your file probably isn't encoded using UTF-8. – NikiC Feb 13 '11 at 09:47
It is, I have used the utf8ToUnicode routine showed in my answer to verify the codepoints in my file. – chx Feb 13 '11 at 09:59
@chx: In that case, I don't know. It doesn't work for me without the modifier, but it works with it. As PHP doesn't have proper multibyte support, issues with encoding are common. – NikiC Feb 13 '11 at 10:05
Yep, as simple as adding the `u`. Follow-up question: what does the `u` indicate, precisely? Since my pattern without the `u` still matched ASCII, I'm guessing it tells the regex something about the nature of the input string, rather than the pattern itself. – Jeff Lee Feb 13 '11 at 18:43
@Jeff Lee: It indicates, that the string should be handled as an UTF-8 string. I.e.: A UTF-8 character may consist of sever bytes. Normally PCRE would match every single byte against your regex. But in UTF-8 mode it will combine those bytes and matches these ;) – NikiC Feb 13 '11 at 18:48
+1, for completion, another way to turn on unicode properties in this pattern: `(*UTF)(*UCP)^[-\' \p{L}]+$` (see [PCRE's Special Start-of-Pattern Modifiers](http://www.rexegg.com/regex-modifiers.html#pcre)) – zx81 Aug 10 '14 at 06:38

score 1 · Answer 2 · edited May 23 '17 at 12:01

If you want to replace Unicode old pattern with new pattern you should write:

$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);

So the key here is u modifier

Note : Your server php version shoud be at least PHP 4.3.5

as mentioned here php.net | Pattern Modifiers

u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic

I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works

Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you

$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);

chx · Answer 3 · 2011-02-13T09:41:11.753

First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).

For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?

score 0 · Answer 4 · answered Jul 31 '20 at 07:12

Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.

See example: https://3v4l.org/4hB9e

Related: Incosistent regex result for Thai characters across different PHP version

score -2 · Answer 5 · answered Sep 28 '20 at 00:58

-2

<?php preg_match('/[a-zığüşöç]/u',$title)  ?>

answered Sep 28 '20 at 00:58

Jack Nal

53
6

Matching Unicode letter characters in PCRE/PHP

5 Answers5

Linked

Related