Java regex doesnt match outside of ascii range, behaves different than python regex

Question

I want to filter Strings from documents the same way sklearn's CountVectorizer does. It uses the following RegEx: (?u)\b\w\w+\b. This java code should behave the same way:

Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");

while(matcher.find()) {
    String match = matcher.group();
    System.out.println(match);
}

But this doesnt produce the desired output, as it does in python:

this
is
the
document
äöa
m²

It instead outputs:

this
is
the
document

What can i do to include non-ascii characters, as the python RegeEx does?

@WiktorStribiżew that does work for `äöa` but doesn't work for `m²` — ctwheels, Mar 21 '18 at 14:34
MAybe this is useful? https://stackoverflow.com/questions/6381752/validating-users-utf-8-name-in-javascript — Lance Toth, Mar 21 '18 at 14:37
Thank you! This works for the german letters, but still doesnt include the squared sign (²), any idea how to fix that one? — Daniel Kirchner, Mar 21 '18 at 14:37
Do you want to make sure the Unicode `\w` in Java regex matches the same chars as Python's Unicode `\w`? — Wiktor Stribiżew, Mar 21 '18 at 14:39
Ok, same thing, but for Java https://stackoverflow.com/questions/10894122/java-regex-for-support-unicode — Lance Toth, Mar 21 '18 at 14:40
@WiktorStribiżew exactly, i have a pretty big document file i am testing this with, the only difference at this point (between python and java) is that python picks up `²,½,³`. — Daniel Kirchner, Mar 21 '18 at 14:43
@LanceToth it is working for unicode now with `(?U)`, only missing exponentials and fractions now. — Daniel Kirchner, Mar 21 '18 at 14:45
To just support super/subscript numbers, you may extend the pattern to `"(?U)[\\w\\p{No}]{2,}"`. — Wiktor Stribiżew, Mar 21 '18 at 14:45
@WiktorStribiżew Thank you very much, this seems to be working! — Daniel Kirchner, Mar 21 '18 at 14:51
Possible duplicate of [Unicode equivalents for \w and \b in Java regular expressions?](https://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions) — Tamas Rev, Mar 21 '18 at 15:20

ctwheels · Accepted Answer · 2018-03-21T18:49:09.323

As suggested by Wiktor in the comments, you could use (?U) to turn on the flag UNICODE_CHARACTER_CLASS. While this does allow matching äöa, this still doesn't match m². That's because UNICODE_CHARACTER_CLASS with \w doesn't recognize ² as a valid alphanumeric character. As a replacement for \w, you can use [\pN\pL_]. This matches Unicode numbers \pN and Unicode letters \pL (plus _). The \pN Unicode character class includes the \pNo character class, which includes the Latin 1 Supplement - Latin-1 punctuation and symbols character class (it includes ²³¹). Alternatively, you could just add the \pNo Unicode character class to a character class with \w. This means the following regular expressions correctly match your strings:

[\pN\pL_]{2,}         # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,}      # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
                      # Adds \pNo to additionally match ²³¹

So why doesn't \w match ² in Java but it does in Python?

Java's interpretation

Looking at OpenJDK 8-b132's Pattern implementation, we get the following information (I removed information irrelevant to answering the question):

Unicode support

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

\w A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]

Great! Now we have a definition for \w when the (?U) flag is used. Plugging these Unicode character classes into this amazing tool will tell you exactly what each of these Unicode character classes match. Without making this post super long, I'll just go ahead and tell you that neither of the following classes matches ²:

\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}

Python's interpretation

So why does Python match ²³¹ when the u flag is used in conjunction with \w? This one was very difficult to track down, but I went digging into Python's source code (I used Python 3.6.5rc1 - 2018-03-13). After removing a lot of the fluff for how this gets called, basically the following happens:

\w is defined as CATEGORY_UNI_WORD, which is then prefixed with SRE_. SRE_CATEGORY_UNI_WORD calls SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD is defined as (SRE_UNI_IS_ALNUM(ch) || (ch) == '_').
SRE_UNI_IS_ALNUM calls Py_UNICODE_ISALNUM, which is, in turn, defined as (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch)).
The important one here is Py_UNICODE_ISDECIMAL(ch), defined as Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch).

Now, let's take a look at the method _PyUnicode_IsDecimalDigit(ch):

int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
    if (_PyUnicode_ToDecimalDigit(ch) < 0)
        return 0;
    return 1;
}

As we can see, this method returns 1 if _PyUnicode_ToDecimalDigit(ch) < 0. So what does _PyUnicode_ToDecimalDigit look like?

int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}

Great, so basically, if the character's UTF-32 encoded byte has the DECIMAL_MASK flag this will evaluate to true and a value greater than or equal to 0 will be returned.

UTF-32 encoded byte value for ² is 0x000000b2 and our flag DECIMAL_MASK is 0x02. 0x000000b2 & 0x02 evaluates to true and so ² is deemed to be a valid Unicode alphanumeric character in python, thus \w with u flag matches ².

score 0 · Answer 2 · answered Mar 21 '18 at 15:20

There is one more step left: you need to specify that \w includes unicode characters too. Pattern.UNICODE_CHARACTER_CLASS for the rescue:

    Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b", Pattern.UNICODE_CHARACTER_CLASS);
                                                   // ^^^^^^^^^^
    Matcher matcher = regex.matcher("this is the document.!? äöa m²");

    while(matcher.find()) {
        String match = matcher.group();
        System.out.println(match);
    }

Java regex doesnt match outside of ascii range, behaves different than python regex

2 Answers2

Java's interpretation

Unicode support

Python's interpretation