5

In python or PHP a simple regex such as /\W/gu matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_], what are the correct ranges to match the same characters as python and PHP?

https://regex101.com/r/yhNF8U/1/

DannyM
  • 743
  • 6
  • 20
  • 1
    @Mandy8055 I want to match anything but word characters just like it works in php and in python (if you click the regex101 link you can see how different languages match that regex) – DannyM Jul 07 '20 at 10:03
  • You could test the below character properties on [regexpal](https://www.regexpal.com/) –  Jul 07 '20 at 10:30

1 Answers1

6

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W will look like:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.

More considerations

The \w construct (and thus its \W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: \W .NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}], where \p{Ll}\p{Lu}\p{Lt}\p{Lo} can be contracted to a sheer \p{L} and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}].

In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}], where \p{gc=Mn}\p{gc=Me}\p{gc=Mc} can be just written as \p{M}.

In PHP PCRE, \W matches [^\p{L}\p{N}_].

Rexegg cheat sheet defines Python 3 \w as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_].

You may roughly decompose \W as [^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

where

  • [^ - is the start of the negated character class that matches a single char other than:
    • \p{L} - any Unicode letter
    • \p{N} - any Unicode digit
    • \p{M} - a diacritic mark
    • \p{Pc} - a connector punctuation symbol
  • ] - end of the character class.

Note it is \p{Pc} class that matches an underscore.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

Other variations:

  • /[^\p{L}0-9_]/gu - to just use \W that is aware of Unicode letters only
  • /[^\p{L}\p{N}_]/gu - (PCRE \W style) to just use \W that is aware of Unicode letters and digits only.

Note that Java's (?U)\W will match a mix of what \W matches in PCRE, Python and .NET.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    I don't know why this answer was downvoted; this doesn't work on regex101, however it seems to work on my code, I'll test it a little more on my code and if it works with every reasonable input I'll accept it – DannyM Jul 07 '20 at 10:06
  • 2
    @thelmuxkriovar regex101 does not support Unicode property classes in JS regex flavor. It is a bug that is only regex101 related. – Wiktor Stribiżew Jul 07 '20 at 10:07
  • 2
    @thelmuxkriovar please Check [this](https://github.com/firasdib/Regex101/issues/1333) issue. –  Jul 07 '20 at 10:10
  • 2
    Well explained and as always very educational ++ – The fourth bird Jul 07 '20 at 10:42
  • 3
    @Thefourthbird Unfortunately, it is pretty concise. This topic deserves a whole book chapter. I made tests before, to see what Unicode-aware `\w` matches in different engines, but I lost most of the details. The main idea though is to design your own character class that only contains the Unicode property classes you need. – Wiktor Stribiżew Jul 07 '20 at 10:45
  • 1
    Your regex reads `\p{M}`, but you detail `\p{Mn}`, is that a typo? – sp00m Aug 18 '20 at 11:41
  • 1
    @sp00m Thank you for spotting the inconsistency, I meant to say `\p{M}`. I noticed it is best to use the whole mark class, especially when dealing with languages like Hebrew, or Indic languages. – Wiktor Stribiżew Aug 18 '20 at 11:43