3

Generatlly I alway though that in Regular Expressions \w is short for [A-Za-z0-9_], as per wikipedia

But recently I had an issue, in C#.NET, that it matches something else. I was parsing some French, and discovered that \w matches ê (e-circumflex).

Strange I though, didn't expect that. So I tested the same regex in a couple other languages and noticed some inconsistencies.

Given the following code samples:

C#.NET (Specifically .NET 4.7.2 if that matters), .NET Fiddle here

var r = new Regex(@"\w");
Console.WriteLine(r.IsMatch("ê"));

output :

True

Javascript (Chrome), JSBin here

var r = /\w/;
console.log(r.test("ê"));

//or 
var s = new RegExp('\w');
console.log(s.test("ê"));

output:

false
false

PHP (v7.4.7), onlinephpfunctions here

$str = "ê";
$pattern = "/\w/";
echo preg_match($pattern, $str);

outputs

0

Perl (v5.24.2), link here

$str = "ê";
if ($str =~ m/\w/i) {
  print "Match found\n";
} else {
  print "No match found\n";
}

outputs

No match found

Python, repl.it here

import re
p = re.compile('\w')
m = p.match("ê")
if m:
    print('Match found')

outputs

Match Found

Is it just me, or something doesn't seem right? Anyone know whats going on here, why are .NET and Python different to PHP, JS and, the daddy of them all Perl.

OJay
  • 4,763
  • 3
  • 26
  • 47

1 Answers1

1

In .NET (as well as XMLSchema, Python 3 (not Python 2), ICU (Android, R stringr / stringi functions), \w is Unicode-aware by default.

It is not Unicode-aware by default in PCRE and Java, but you may turn it on using the right flag, /u in PCRE and (?U) / Pattern.UNICODE_CHARACTER_CLASS in Java.

See the Shorthand Character Classes reference:

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

The Unicode-aware \w meanings:

  • - [\p{L}\p{Nd}\p{Mn}\p{Pc}] (source)
  • - [\p{L}\p{Mn}\p{Nd}_] (source) (Note: this is an approximate pattern that can only be used with PyPi regex since re does not support Unicode property classes, so it's really great \w is Unicode aware in Python 3)
  • - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (source)
  • - [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d] (source)
  • - [#x0000=#x10FFFF]-[\p{P}\p{Z}\p{C}] (source)

When \w is made Unicode-aware:

In JavaScript, there is no way to make \w Unicode-aware, so use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}].

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563