7

...when used in patterns like "\\p{someCharacterClass}".
I've used/seen some:

  • Lower
  • Upper
  • InCombiningDiacriticalMarks
  • ASCII

What is the definitive list of all supported built-in character classed? Where is it documented? What are the exact meanings?

Edited...

There seem to be a lot of "RTFM" answers refering to the javadoc for Pattern. That's the first place I looked before asking this question. Just so everyone is clear, the javadoc for Pattern makes no mention of any of the classes listed above.

The "correct" answer will mention "InCombiningDiacriticalMarks" somewhere on the page, and will not be some vague reference to "Unicode Standards".

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 4
    Have you checked the [`Pattern` documentation](http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)? – Greg Hewgill Dec 27 '11 at 23:22
  • 5
    @GregHewgill Yes i did check it... did you? That's where I looked first, and there's no mention of the above there, not any links to pages that do either – Bohemian Dec 27 '11 at 23:46
  • See the sections titled "POSIX character classes", "java.lang.Character classes", and "Unicode Support": *The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.* – Greg Hewgill Dec 27 '11 at 23:48
  • @GregHewgill OK, cool. So exactly what is the link to the page that lists all of the above and their meanings? – Bohemian Dec 27 '11 at 23:59
  • I went to [`UnicodeBlock.forName`](http://docs.oracle.com/javase/6/docs/api/java/lang/Character.UnicodeBlock.html#forName(java.lang.String)) which led to http://unicode.org, where I found [Where can I find the definitive list of Unicode blocks?](http://unicode.org/faq/blocks_ranges.html#5) and finally [`Blocks.txt`](http://www.unicode.org/Public/UNIDATA/Blocks.txt) itself. – Greg Hewgill Dec 28 '11 at 00:04
  • @GregHewgill That link is a good start, but it doesn't define what each class means. Most are obvious by their name, but for example what does the `Tags` class match? – Bohemian Dec 28 '11 at 03:17
  • The `Blocks.txt` file notes the code point range, so then get the code chart for that range: http://www.unicode.org/charts/PDF/UE0000.pdf (I don't know what those "Tags" are used for either.) – Greg Hewgill Dec 28 '11 at 03:31
  • @GregHewgill OK, good answer. If you post an answer with this in it, I'll accept it! Thanks for your tenacity. – Bohemian Dec 28 '11 at 07:39
  • @GregHewgill btw, those "tags" are ascii characters with literally a little luggage tag under each one - [look at them here](http://www.unicode.org/charts/PDF/Unicode-3.1/U31-E0000.pdf) – Bohemian Dec 28 '11 at 09:00

5 Answers5

11

The documentation for Pattern says in the "Unicode Support" section:

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

The documentation for UnicodeBlock.forName states:

Block names are determined by The Unicode Standard.

On http://unicode.org there is the FAQ Where can I find the definitive list of Unicode blocks?:

A: The Unicode blocks and their names are a normative part of the Unicode Standard. The exact list is always maintained in one of the files of the Unicode Character Database, Blocks.txt.

Finally, in Blocks.txt there is the line:

0300..036F; Combining Diacritical Marks

These characters can be found in the Combining Diacritical Marks code chart (from Unicode 6.0 Character Code Charts).

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
1

Pattern API says to adhere to regular expression level 1 as defined by http://www.unicode.org/reports/tr18/

There are three nice tables (search UCD.html), and look at UCD.html itself.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

This page has some good details for a few popular classes like:

  • \p{L} or \p{Letter}: any kind of letter from any language.
  • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
  • \p{N} or \p{Number}: any kind of numeric character in any script.
Justin Harris
  • 1,969
  • 2
  • 23
  • 33
0

The spec is http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html. In some cases, it refers to TUS for a particular version, you can find that material at http://unicode.org.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • Yeah I read Pattern javadoc before posting this question. What is the exact link to the page that lists all of the classes and their meanings? – Bohemian Dec 28 '11 at 00:01
-1

Look in the javadocs for the Pattern class.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • Yeah I read Pattern javadoc before posting this question. What is the exact link to the page that lists all of the classes and their meanings? – Bohemian Dec 28 '11 at 00:01
  • It's in the javadoc link that I posted. That's the precise page, unless I fail to understand your question. – duffymo Dec 28 '11 at 00:37
  • Where is "CombiningDiacriticalMarks" on that page? (That is a rhetorical question. It's not there). I want the link to the full list and definition of each supported character class. – Bohemian Dec 28 '11 at 03:20
  • Moderators, please note: It's another case of an answer of mine being singled out for down vote years after the fact. None of the others on the page are downvoted. I can't help but think it's targeted. – duffymo Mar 18 '15 at 20:50