5

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".

The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?

For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
Aurand
  • 5,487
  • 1
  • 25
  • 35

2 Answers2

7

Quoting the Java docs about isDigit:

A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.

So, I believe the pattern to match digits should be \p{Nd}.

Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

mgibsonbr
  • 21,755
  • 7
  • 70
  • 112
  • Simply `\p{N}` works: `System.out.println("3๓३".matches("\\p{N}+")) // true` – Bohemian Apr 25 '16 at 18:33
  • 2
    @Bohemian But `p{N}` also matches `Nl` and `No`, which `isDigit` does not match. [Example](http://ideone.com/1GHJ1P). Sometimes you *want* to match those, but since the OP asked for a behavior consistent with `isDigit`, I answered with just `Nd`. – mgibsonbr Apr 26 '16 at 00:12
5

Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:

(?U)\d+

or in code:

System.out.println("3๓३".matches("(?U)\\d+")); // true

Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:

Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • One of the rare occasions where I don't think a duplicate post deserves a downvote. New answer and all that. Don't you have merge powers, or does that not apply here? – Savior Apr 25 '16 at 19:23
  • 1
    @Pillar so merged. IMHO this answer is easier to remember and understand - who can remember all those funky posix classes? – Bohemian Apr 25 '16 at 19:29
  • What POSIX classes do you mean? POSIX character classes are `[:punct:]`, `[:digit:]`, etc. `\p{N}`, or `\p{L}` etc. are Unicode category classes (term used in .NET), or Unicode character properties (term used in PHP), and these are very handy, especially `\p{Ll}` and `\p{Lu}`. In Java, certainly `(?U)\d` looks preferable. – Wiktor Stribiżew Apr 28 '16 at 20:54