180

I am new to regular expressions and have been given the following regular expression:

(\p{L}|\p{N}|_|-|\.)*

I know what * means and | means "or" and that \ escapes.

But what I don't know what \p{L} and \p{N} means. I have searched Google for it, without result...

Can someone help me?

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Diemauerdk
  • 5,238
  • 9
  • 40
  • 56

2 Answers2

263

\p{L} matches a single code point in the category "letter".
\p{N} matches any kind of numeric character in any script.

Source: regular-expressions.info

If you're going to work with regular expressions a lot, I'd suggest bookmarking that site, it's very useful.

Cerbrus
  • 70,800
  • 18
  • 132
  • 147
  • thx for the fast answer :). But shouldnt the regex then match 10? I have tried an online regex matcher: http://regexpal.com/ – Diemauerdk Feb 15 '13 at 09:10
  • @user1093774: I don't think [regexpal](http://regexpal.com/) supports `\p{}`, but yes, it should match. – Cerbrus Feb 15 '13 at 09:12
  • 1
    This syntax is specific for modern Unicode regex implementation, which not all interpreters recognize. You can safely replace \p{L} by {a-zA-Z} (ascii notation) or {\w} (perl/vim notation); and \p{N} by {0-9} (ascii) or {\d} (perl/vim). If you want to match all of them, just do: {a-zA-Z0-9}+ or {\w\d}+ – Rafael Beckel Aug 18 '15 at 02:46
  • 46
    Rafael, I dont' agree that you can safely replace `\p{L}` by `{a-zA-Z}`. `{a-zA-Z}`, for example, will not match any accented character, such as `é`, which is used all over in French. So these are only safely replaceable if you are sure that you will only be processing english, and nothing else. – Rolf Nov 08 '17 at 12:19
  • Does it match code point or code unit? https://stackoverflow.com/a/27331885/4928642 – Qwertiy Nov 07 '18 at 15:39
  • Note: if doing a regex like this in a browser, you need to pass the `u` flag. https://stackoverflow.com/a/52205643/329062 – Greg Mar 02 '21 at 19:08
50

These are Unicode property shortcuts (\p{L} for Unicode letters, \p{N} for Unicode digits). They are supported by .NET, Perl, Java, PCRE, XML, XPath, JGSoft, Ruby (1.9 and higher) and PHP (since 5.1.0)

At any rate, that's a very strange regex. You should not be using alternation when a character class would suffice:

[\p{L}\p{N}_.-]*
gondo
  • 979
  • 1
  • 10
  • 29
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • its regex in xml - i have not constrcuted the regex myself :) – Diemauerdk Feb 15 '13 at 09:13
  • Apart from the fact that capturing parentheses were used, the REs will actually compile to the same thing (well, in any optimizing RE engine that supports the `\p{…}` escape sequence style in the first place). – Donal Fellows Feb 15 '13 at 09:34
  • that looks like XRegExp unicode plugin. which if so, would be any alpha-numeric in any language – Tim Oct 30 '15 at 19:10
  • Thanks, listing supporting languages was useful, unaware there were limitations there (most regex'y things being "universal"). – HoldOffHunger Jul 19 '18 at 20:42
  • @HoldOffHunger: Far from it, unfortunately. That's why there is a market for tools like RegexBuddy. Take a look at https://www.regular-expressions.info/refbasic.html, you'll be amazed at the subtle and not-so-subtle differences between regex flavors... – Tim Pietzcker Jul 20 '18 at 06:23
  • @TimPietzcker According to www.regular-expressions.info `The PHP preg functions ... support Unicode when the /u option is appended to the regular expression.` Therefore `/u` would need to be at the end, would it not? – davidhartman00 Jan 09 '20 at 17:07