In Java, a symbol is \pS
, which is not the same as punctuation characters, which are \pP
.
I talk about this issue, plus enumerate the types for all the ASCII punctuation and symbols, here in this answer.
Patterns like [\p{Alnum}\s]
only work on legacy dataset from the 1960s. To work on things with the Java native characters set, you needs something on the order of
identifier_charclass = "[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}[\\p{InEnclosedAlphanumerics}&&\\p{So}]]";
whitespace_charclass = "[\\u000A\\u000B\\u000C\\u000D\\u0020\\u0085\\u00A0\\u1680\\u180E\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200A\\u2028\\u2029\\u202F\\u205F\\u3000]";
ident_or_white = "[" + identifier_charclass + whitespace_charclass + "]";
I’m sorry that Java makes it so difficult to work with modern dataset, but at least it is possible.
Just don’t ask about boundaries or grapheme clusters. For that, see my others posting.