31

I stumble across this regular expression in c# I would like to port to javascript, and I do not understand the following:

[-.\p{Lu}\p{Ll}0-9]+

The part I have a hard time with is of course \p{Lu}. All regexp websites I visited never mention this modifier.

Any idea?

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
Mikaël Mayer
  • 10,425
  • 6
  • 64
  • 101
  • 4
    see the description at the right side in this link http://regex101.com/r/lG2nG9/1 – Avinash Raj Sep 22 '14 at 15:06
  • 1
    http://www.regular-expressions.info/unicode.html#category – Smern Sep 22 '14 at 15:06
  • 1
    Always one more website ! thank you for regex101 which is very cool. Never saw this website before. post it as answer? Or I'll delete the question if it is too obvious (but was not for me) – Mikaël Mayer Sep 22 '14 at 15:08
  • You could use `\p{L}` instead of `\p[Lu}\p{Ll}` – Toto Sep 22 '14 at 15:18
  • For C#/.Net regular expression syntax consider visiting MSDN - [Regular Expression Language](http://msdn.microsoft.com/en-us/library/az24scfc%28v=vs.110%29.aspx) and subsequent [Character classes](http://msdn.microsoft.com/en-us/library/20bw873z%28v=vs.110%29.aspx). – Alexei Levenkov Sep 22 '14 at 15:22

1 Answers1

35

These are considered Unicode properties.

The Unicode property \p{L} — shorthand for \p{Letter} will match any kind of letter from any language. Therefore, \p{Lu} will match an uppercase letter that has a lowercase variant. And, the opposite \p{Ll} will match a lowercase letter that has an uppercase variant.

Concisely, this would match any lowercase/uppercase that has a variant from any language:

AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • 3
    could you explain `uppercase letter that has a lowercase variant`? Mainly the `lowecase variant`. – Avinash Raj Sep 22 '14 at 15:16
  • 2
    @AvinashRaj It means that in the unicode alphabet a character can be rendered in both uppercase and lowercase, and to match only the uppercase version of that letter. It also implies that there are uppercase characters that have no lowercase version. – Reactgular Sep 22 '14 at 15:24
  • So then would `\p{L}` potentially match some characters that the given regex wouldn't? Namely those that don't have an uppercase or lowercase variant? – Brian Reischl Sep 22 '14 at 15:27
  • Think about the lowercase German character `ß`. Since this letter cannot occur at the beginning of a word, there is never going to be an uppercase variant for it. – OnlineCop Nov 14 '15 at 08:23
  • 1
    @OnlineCop Well, good thing if you used `\p{L}`, instead of maintaining a hardcoded list yourself, as there is now an uppercase `ẞ`. [Wikipedia](https://en.wikipedia.org/wiki/Capital_%C3%9F) has details. – luckydonald May 19 '20 at 13:34