Generatlly I alway though that in Regular Expressions \w is short for [A-Za-z0-9_], as per wikipedia
But recently I had an issue, in C#.NET, that it matches something else. I was parsing some French, and discovered that \w matches ê (e-circumflex).
Strange I though, didn't expect that. So I tested the same regex in a couple other languages and noticed some inconsistencies.
Given the following code samples:
C#.NET (Specifically .NET 4.7.2 if that matters), .NET Fiddle here
var r = new Regex(@"\w");
Console.WriteLine(r.IsMatch("ê"));
output :
True
Javascript (Chrome), JSBin here
var r = /\w/;
console.log(r.test("ê"));
//or
var s = new RegExp('\w');
console.log(s.test("ê"));
output:
false
false
PHP (v7.4.7), onlinephpfunctions here
$str = "ê";
$pattern = "/\w/";
echo preg_match($pattern, $str);
outputs
0
Perl (v5.24.2), link here
$str = "ê";
if ($str =~ m/\w/i) {
print "Match found\n";
} else {
print "No match found\n";
}
outputs
No match found
Python, repl.it here
import re
p = re.compile('\w')
m = p.match("ê")
if m:
print('Match found')
outputs
Match Found
Is it just me, or something doesn't seem right? Anyone know whats going on here, why are .NET and Python different to PHP, JS and, the daddy of them all Perl.