As suggested by Wiktor in the comments, you could use (?U)
to turn on the flag UNICODE_CHARACTER_CLASS
. While this does allow matching äöa
, this still doesn't match m²
. That's because UNICODE_CHARACTER_CLASS
with \w
doesn't recognize ²
as a valid alphanumeric character. As a replacement for \w
, you can use [\pN\pL_]
. This matches Unicode numbers \pN
and Unicode letters \pL
(plus _
). The \pN
Unicode character class includes the \pNo
character class, which includes the Latin 1 Supplement - Latin-1 punctuation and symbols character class (it includes ²³¹
). Alternatively, you could just add the \pNo
Unicode character class to a character class with \w
. This means the following regular expressions correctly match your strings:
[\pN\pL_]{2,} # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,} # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
# Adds \pNo to additionally match ²³¹
So why doesn't \w
match ²
in Java but it does in Python?
Java's interpretation
Looking at OpenJDK 8-b132's Pattern
implementation, we get the following information (I removed information irrelevant to answering the question):
Unicode support
The following Predefined Character classes and POSIX character
classes are in conformance with the recommendation of Annex C:
Compatibility Properties of Unicode Regular Expression, when
UNICODE_CHARACTER_CLASS
flag is specified.
\w
A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
Great! Now we have a definition for \w
when the (?U)
flag is used. Plugging these Unicode character classes into this amazing tool will tell you exactly what each of these Unicode character classes match. Without making this post super long, I'll just go ahead and tell you that neither of the following classes matches ²
:
\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}
Python's interpretation
So why does Python match ²³¹
when the u
flag is used in conjunction with \w
? This one was very difficult to track down, but I went digging into Python's source code (I used Python 3.6.5rc1 - 2018-03-13). After removing a lot of the fluff for how this gets called, basically the following happens:
\w
is defined as CATEGORY_UNI_WORD
, which is then prefixed with SRE_
. SRE_CATEGORY_UNI_WORD
calls SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD
is defined as (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')
.
SRE_UNI_IS_ALNUM
calls Py_UNICODE_ISALNUM
, which is, in turn, defined as (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
.
- The important one here is
Py_UNICODE_ISDECIMAL(ch)
, defined as Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
.
Now, let's take a look at the method _PyUnicode_IsDecimalDigit(ch)
:
int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
if (_PyUnicode_ToDecimalDigit(ch) < 0)
return 0;
return 1;
}
As we can see, this method returns 1
if _PyUnicode_ToDecimalDigit(ch) < 0
. So what does _PyUnicode_ToDecimalDigit
look like?
int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);
return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}
Great, so basically, if the character's UTF-32 encoded byte has the DECIMAL_MASK
flag this will evaluate to true and a value greater than or equal to 0
will be returned.
UTF-32 encoded byte value for ²
is 0x000000b2
and our flag DECIMAL_MASK
is 0x02
. 0x000000b2 & 0x02
evaluates to true and so ²
is deemed to be a valid Unicode alphanumeric character in python, thus \w
with u
flag matches ²
.