17

I need help with regular expressions. My string contains unicode characters and code below doesn't work.

First four characters must be numbers, then comma and then any alphabetic characters or whitespaces... I already read that if i add /u on end of regular expresion but it didn't work for me...

My code works with non-unicode characters

$post = '9999,škofja loka';;
echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+', $post);

Thanks for your answers!

Mark Fox
  • 8,694
  • 9
  • 53
  • 75
Gasper
  • 959
  • 3
  • 11
  • 29

4 Answers4

35

Updated answer:
This is now tested and working

$post = '9999, škofja loka';
echo preg_match('/^\\d{4},[\\s\\p{L}]+$/u', $post);

\\w will not work, because it does not contain all unicode letters and contains also [0-9_] additionally to the letters.

Important is also the u modifier to activate the unicode mode.

If there can be letters or whitespace after the comma then you should put those into the same character class, in your regex there are 0 or more whitespace after the comma and then there are only letters.

See http://www.regular-expressions.info/php.html for php regex details

The \\p{L} (Unicode letter) is explained here

Important is also the use of the end of string boundary $ to ensure that really the complete string is verified, otherwise it will match only the first whitespace and ignore the rest for example.

stema
  • 90,351
  • 20
  • 107
  • 135
  • doesn't work = return 0: $post = '9999,škofja loka'; echo preg_match('/^[0-9]{4},[\s\w]+/u', $post); – Gasper Jun 20 '11 at 07:54
  • @gašper, so now I tested it [online](http://writecodeonline.com/php/) and it seems that PHP needs to be double escaped `preg_match('/^\\d{4},[\\s\\w]+$/u', $post);` but it seems that `\\w` does not include the unicode characters, even with `u` modifier. – stema Jun 20 '11 at 08:14
  • 1
    @gašper, I did some more testing and updated my answer – stema Jun 20 '11 at 08:16
  • can i use that regular expression also in js? – Gasper Jun 20 '11 at 08:23
  • @gašper, I don't think so, [http://www.regular-expressions.info/javascript.html](http://www.regular-expressions.info/javascript.html) is explaining the javascript regex flavour and it says that it does not support unicode (except you give the character explicitly, like `^\d{4},[\sa-zA-Zš]+$`) – stema Jun 20 '11 at 08:27
  • 1
    there is a library for unicode in js and much more http://xregexp.com/ – llamerr Apr 20 '12 at 16:37
8

[a-zA-Z] will match only letters in the range of a-z and A-Z. You have non-US-ASCII letters, and therefore your regex won't match, regardless of the /u modifier. You need to use the word character escape sequence (\w).

$post = '9999,škofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);
jmz
  • 5,399
  • 27
  • 29
7

The problem is your regular expression. You are explicitly saying that you will only accept a b c ... z A B C ... Z. š is not in the a-z set. Remember, š is as different to s as any other character.

So if you really just want a sequence of letters, then you need to test for the unicode properties. e.g.

echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);

That shouuld work because \p{L} matches any unicode character which is considered a letter. Not just A through Z.

Sodved
  • 8,428
  • 2
  • 31
  • 43
  • This doesn't work right: this should return 0 but it return 1 $post = '9999,ščćžđkofja loka,.(?*'; echo preg_match('/^[0-9]{4},[\s]*\p{L}+/', $post); – Gasper Jun 20 '11 at 07:51
  • One thing - in your test program is the $post program in UTF-8? Sorry I'm not that good at php. But in perl if you just enter the character `š` you get a string of one byte 9A. In UTF-8 that character needs to be two bytes C5 A1 (which looks like `Å¡` in a latin character encoding. – Sodved Jun 20 '11 at 08:14
0

Add a u, and remember the trailing slash:

echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+/u', $post);

Edited:

echo preg_match('/^\d{4},(?:\s|\w)+/u', $post);
searlea
  • 8,173
  • 4
  • 34
  • 37