Regex \w matches ê

Question

Generatlly I alway though that in Regular Expressions \w is short for [A-Za-z0-9_], as per wikipedia

But recently I had an issue, in C#.NET, that it matches something else. I was parsing some French, and discovered that \w matches ê (e-circumflex).

Strange I though, didn't expect that. So I tested the same regex in a couple other languages and noticed some inconsistencies.

Given the following code samples:

C#.NET (Specifically .NET 4.7.2 if that matters), .NET Fiddle here

var r = new Regex(@"\w");
Console.WriteLine(r.IsMatch("ê"));

output :

True

Javascript (Chrome), JSBin here

var r = /\w/;
console.log(r.test("ê"));

//or 
var s = new RegExp('\w');
console.log(s.test("ê"));

output:

false
false

PHP (v7.4.7), onlinephpfunctions here

$str = "ê";
$pattern = "/\w/";
echo preg_match($pattern, $str);

outputs

Perl (v5.24.2), link here

$str = "ê";
if ($str =~ m/\w/i) {
  print "Match found\n";
} else {
  print "No match found\n";
}

outputs

No match found

Python, repl.it here

import re
p = re.compile('\w')
m = p.match("ê")
if m:
    print('Match found')

outputs

Match Found

Is it just me, or something doesn't seem right? Anyone know whats going on here, why are .NET and Python different to PHP, JS and, the daddy of them all Perl.

Because the regex libraries differ. No need to be surprised. In Python and .NET, it is Unicode-aware by default. — Wiktor Stribiżew, Nov 11 '20 at 20:53
`\w` means alphanumeric. In some regexp libraries that means alphabetic in any language, others treat is just as ASCII. — Barmar, Nov 11 '20 at 20:55
Well wadda ya know, though I must not be seeing the wood for the trees, cheers — OJay, Nov 11 '20 at 21:00
Oh, it is also Unicode aware by default in XML Schema (XSL), Android and ICU regex. — Wiktor Stribiżew, Nov 11 '20 at 21:01

score 1 · Accepted Answer · answered Nov 11 '20 at 21:36

In .NET (as well as XMLSchema, Python 3 (not Python 2), ICU (Android, R stringr / stringi functions), \w is Unicode-aware by default.

It is not Unicode-aware by default in PCRE and Java, but you may turn it on using the right flag, /u in PCRE and (?U) / Pattern.UNICODE_CHARACTER_CLASS in Java.

See the Shorthand Character Classes reference:

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

The Unicode-aware \w meanings:

c# - [\p{L}\p{Nd}\p{Mn}\p{Pc}] (source)
python - [\p{L}\p{Mn}\p{Nd}_] (source) (Note: this is an approximate pattern that can only be used with PyPi regex since re does not support Unicode property classes, so it's really great \w is Unicode aware in Python 3)
android - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (source)
icu - [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d] (source)
xsd - [#x0000=#x10FFFF]-[\p{P}\p{Z}\p{C}] (source)

When \w is made Unicode-aware:

pcre - (With /u in PHP or (*UCP) / (*UTF)(*UCP)) - [^\p{L}\p{N}_] ("\w any character that matches \p{L} or \p{N}, plus underscore")
java - (With (?U) or Pattern.UNICODE_CHARACTER_CLASS) - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (same as Andoid, source)
perl - (make the file treat as Unicode, see Does \w match all alphanumeric characters defined in the Unicode standard?) - [\p{GC=Alphabetic}\p{GC=Mark}\p{GC=Connector_Punctuation}\p{GC=Decimal_Number}]

In JavaScript, there is no way to make \w Unicode-aware, so use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}].

Regex \w matches ê

1 Answers1

Linked