Regular expression \p{L} and \p{N}

Question

I am new to regular expressions and have been given the following regular expression:

(\p{L}|\p{N}|_|-|\.)*

I know what * means and | means "or" and that \ escapes.

But what I don't know what \p{L} and \p{N} means. I have searched Google for it, without result...

Can someone help me?

Cerbrus · Accepted Answer · 2016-10-13T08:42:16.713

263

\p{L} matches a single code point in the category "letter".
\p{N} matches any kind of numeric character in any script.

Source: regular-expressions.info

If you're going to work with regular expressions a lot, I'd suggest bookmarking that site, it's very useful.

edited Oct 13 '16 at 08:42

answered Feb 15 '13 at 09:03

Cerbrus

70,800
18
132
147

thx for the fast answer :). But shouldnt the regex then match 10? I have tried an online regex matcher: http://regexpal.com/ – Diemauerdk Feb 15 '13 at 09:10
@user1093774: I don't think [regexpal](http://regexpal.com/) supports `\p{}`, but yes, it should match. – Cerbrus Feb 15 '13 at 09:12
1

This syntax is specific for modern Unicode regex implementation, which not all interpreters recognize. You can safely replace \p{L} by {a-zA-Z} (ascii notation) or {\w} (perl/vim notation); and \p{N} by {0-9} (ascii) or {\d} (perl/vim). If you want to match all of them, just do: {a-zA-Z0-9}+ or {\w\d}+ – Rafael Beckel Aug 18 '15 at 02:46
46

Rafael, I dont' agree that you can safely replace `\p{L}` by `{a-zA-Z}`. `{a-zA-Z}`, for example, will not match any accented character, such as `é`, which is used all over in French. So these are only safely replaceable if you are sure that you will only be processing english, and nothing else. – Rolf Nov 08 '17 at 12:19
Does it match code point or code unit? https://stackoverflow.com/a/27331885/4928642 – Qwertiy Nov 07 '18 at 15:39
Note: if doing a regex like this in a browser, you need to pass the `u` flag. https://stackoverflow.com/a/52205643/329062 – Greg Mar 02 '21 at 19:08

score 50 · Answer 2 · edited Feb 10 '16 at 20:29

50

These are Unicode property shortcuts (\p{L} for Unicode letters, \p{N} for Unicode digits). They are supported by .NET, Perl, Java, PCRE, XML, XPath, JGSoft, Ruby (1.9 and higher) and PHP (since 5.1.0)

At any rate, that's a very strange regex. You should not be using alternation when a character class would suffice:

[\p{L}\p{N}_.-]*

edited Feb 10 '16 at 20:29

gondo

979
1
10
29

answered Feb 15 '13 at 09:06

Tim Pietzcker

328,213
58
503
561

its regex in xml - i have not constrcuted the regex myself :) – Diemauerdk Feb 15 '13 at 09:13
Apart from the fact that capturing parentheses were used, the REs will actually compile to the same thing (well, in any optimizing RE engine that supports the `\p{…}` escape sequence style in the first place). – Donal Fellows Feb 15 '13 at 09:34
that looks like XRegExp unicode plugin. which if so, would be any alpha-numeric in any language – Tim Oct 30 '15 at 19:10
Thanks, listing supporting languages was useful, unaware there were limitations there (most regex'y things being "universal"). – HoldOffHunger Jul 19 '18 at 20:42
@HoldOffHunger: Far from it, unfortunately. That's why there is a market for tools like RegexBuddy. Take a look at https://www.regular-expressions.info/refbasic.html, you'll be amazed at the subtle and not-so-subtle differences between regex flavors... – Tim Pietzcker Jul 20 '18 at 06:23
@TimPietzcker According to www.regular-expressions.info `The PHP preg functions ... support Unicode when the /u option is appended to the regular expression.` Therefore `/u` would need to be at the end, would it not? – davidhartman00 Jan 09 '20 at 17:07

Regular expression \p{L} and \p{N}

2 Answers2

Linked