How to identify programmatically in Java which Unicode version supported?

Question

Due to the fact that Java code could be run in any Java VM I'd like to know how is it possible to identify programmatically which Unicode version supported?

I suggest clarifying the question to make reference to http://www.unicode.org/versions/ and that you really are talking about the unicode version, not character set. — Arafangion, Aug 04 '11 at 12:29

tchrist · Answer 1 · 2011-08-06T11:29:34.273

The easiest way but worst way I can think of to do that would be to pick a code point that’d new to each Unicode release, and check its Character properties. Or you could check its General Category with a regex. Here are some selected code points:

Unicode 6.0.0:

Ꞡ  U+A7A0 GC=Lu SC=Latin    LATIN CAPITAL LETTER G WITH OBLIQUE STROKE
₹  U+20B9 GC=Sc SC=Common   INDIAN RUPEE SIGN
ₜ  U+209C GC=Lm SC=Latin    LATIN SUBSCRIPT SMALL LETTER T

Unicode 5.2:

Ɒ  U+2C70 GC=Lu SC=Latin    LATIN CAPITAL LETTER TURNED ALPHA
‭⅐ U+2150 GC=No SC=Common   VULGAR FRACTION ONE SEVENTH
⸱  U+2E31 GC=Po SC=Common   WORD SEPARATOR MIDDLE DOT

Unicode 5.1:

‭ꝺ  U+A77A GC=Ll SC=Latin    LATIN SMALL LETTER INSULAR D
Ᵹ  U+A77D GC=Lu SC=Latin    LATIN CAPITAL LETTER INSULAR 
⚼  U+26BC GC=So SC=Common    SESQUIQUADRATE

Unicode 5.0:

Ⱶ  U+2C75 GC=Lu SC=Latin    LATIN CAPITAL LETTER HALF H
ɂ  U+0242 GC=Ll SC=Latin    LATIN SMALL LETTER GLOTTAL STOP
⬔  U+2B14 GC=So SC=Common  SQUARE WITH UPPER RIGHT DIAGONAL HALF BLACK

I've included the general category and the script property, although you can only inspect the script in JDK7, the first Java release that supports that.

I found those code points by running commands like this from the command line:

% unichars -gs '\p{Age=5.1}'
% unichars -gs '\p{Lu}' '\p{Age=5.0}'

Where that’s the unichars program. It will only find properties supported in the Unicode Character Database for whichever UCD version that the version of Perl you’re running supports.

I also like my output sorted, so I tend to run

 % unichars -gs '\p{Alphabetic}' '\p{Age=6.0}' | ucsort | less -r

where that’s the ucsort program, which sorts text according to the Unicode Collation Algorithm.

However, in Perl unlike in Java this is easy to find out. For example, if you run this from the command line (yes, there’s a programmer API, too), you find:

$ corelist -a Unicode
    v5.6.2     3.0.1     
    v5.8.0     3.2.0     
    v5.8.1     4.0.0 
    v5.8.8     4.1.0
    v5.10.0    5.0.0     
    v5.10.1    5.1.0 
    v5.12.0    5.2.0 
    v5.14.0    6.0.0

That shows that Perl version 5.14.0 was the first one to support Unicode 6.0.0. For Java, I believe there is no API that gives you this information directly, so you’ll have to hardcode a table mapping Java versions and Unicode versions, or else use the empirical method of testing code points for properties. By empirically, I mean the equivalent of this sort of thing:

% ruby -le 'print "\u2C75" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
pass 5.2
% ruby -le 'print "\uA7A0" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
fail 6.0
% ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]

% perl -le 'print "\x{2C75}" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
pass 5.2
% perl -le 'print "\x{A7A0}" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
pass 6.0
% perl -v
This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level

To find out the age of a particular code point, run uniprops -a on it like this:

% uniprops -a 10424
U+10424 ‹› \N{DESERET CAPITAL LETTER EN}
 \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
 All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
 Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

All my Unicode tools are available in the Unicode::Tussle bundle, including unichars, uninames, uniquote, ucsort, and many more.

Java 1.7 Improvements

JDK7 goes a long way to making a few Unicode things easier. I talk about that a bit at the end of my OSCON Unicode Support Shootout talk. I had thought of putting together a table of which languages supports which versions of Unicode in which versions of those languages, but ended up scrapping that to tell people to just get the latest version of each language. For example, I know that Unicode 6.0.0 is supported by Java 1.7, Perl 5.14, and Python 2.7 or 3.2.

JDK7 contains updates for classes Character, String, and Pattern in support of Unicode 6.0.0. This includes support for Unicode script properties, and several enhancements to Pattern to allow it to meet Level 1 support requirements for Unicode UTS#18 Regular Expressions. These include

The isupper and islower methods now correctly correspond to the Unicode uppercase and lowercase properties; previously they misapplied only to letters, which isn’t right, because it misses Other_Uppercase and Other_Lowercase code points, respectively. For example, these are some lowercase codepoints which are not GC=Ll (lowercase letters), selected samples only:

% unichars -gs '\p{lowercase}' '\P{LL}'
◌ͅ  U+0345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI
ͺ  U+037A GC=Lm SC=Greek        GREEK YPOGEGRAMMENI
ˢ  U+02E2 GC=Lm SC=Latin        MODIFIER LETTER SMALL S
ˣ  U+02E3 GC=Lm SC=Latin        MODIFIER LETTER SMALL X
ᴬ  U+1D2C GC=Lm SC=Latin        MODIFIER LETTER CAPITAL A
ᴮ  U+1D2E GC=Lm SC=Latin        MODIFIER LETTER CAPITAL B
ᵂ  U+1D42 GC=Lm SC=Latin        MODIFIER LETTER CAPITAL W
ᵃ  U+1D43 GC=Lm SC=Latin        MODIFIER LETTER SMALL A
ᵇ  U+1D47 GC=Lm SC=Latin        MODIFIER LETTER SMALL B
ₐ  U+2090 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER A
ₑ  U+2091 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER E
ⅰ  U+2170 GC=Nl SC=Latin        SMALL ROMAN NUMERAL ONE
ⅱ  U+2171 GC=Nl SC=Latin        SMALL ROMAN NUMERAL TWO
ⅲ  U+2172 GC=Nl SC=Latin        SMALL ROMAN NUMERAL THREE
ⓐ  U+24D0 GC=So SC=Common       CIRCLED LATIN SMALL LETTER A
ⓑ  U+24D1 GC=So SC=Common       CIRCLED LATIN SMALL LETTER B
ⓒ  U+24D2 GC=So SC=Common       CIRCLED LATIN SMALL LETTER C

The alphabetic tests are now correct in that they use Other_Alphabetic. They did this wrong prior to 1.7, which is a problem.
The \x{HHHHH} pattern escape so you can meet RL1.1; this lets you rewrite [-] (which fails due to The UTF‐16 Curse) as [\x{1D49C}-\x{1D4B5}]. JDK7 is the first Java release that fully/correctly supports non-BMP characters in this regard. Amazing but true.
More properties for RL1.2, of which the script property is by far the most important. This lets you write \p{script=Greek} for example, abbreviated as \p{Greek}.
The new UNICODE_CHARACTER_CLASSES pattern compilation flag and corresponding pattern‐embeddable flag "(?U)" to meet RL1.2a on compatibility properties.

I can certainly see why you want to make sure you’re running a Java with Unicode 6.0.0 support, since that comes with all those other benefits, too.

score 6 · Accepted Answer · edited May 23 '17 at 10:24

This is not trivial if you are looking for a class to make this information available to you.

Typically, versions of Unicode supported by Java change from one major specification to another, and this information is documented in the Character class of the Java API documentation (which is derived from the Java Language specification). You cannot however rely on the Java language specification, as each major version of Java need not have its own version of the Java Language Specification.

Therefore, you ought to go transliterate between the version of Java supported by the JVM, and the supported Unicode version as:

String specVersion = System.getProperty("java.specification.version");
if(specVersion.equals("1.7"))
    return "6.0";
else if(specVersion.equals("1.6"))
    return "4.0";
else if(specVersion.equals("1.5"))
    return "4.0";
else if(specVersion.equals("1.4"))
    return "3.0";
... and so on

The details of the supported versions can be obtained from the Java Language Specification. Referring from JSR 901 which is the Language specification of Java 7:

The Java SE platform tracks the Unicode specification as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character.

Versions of the Java programming language prior to 1.1 used Unicode version 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode 3.0), and Java SE 5.0 (to Unicode 4.0).

@tchrist: I guess you'd have to use a BigDecimal though, unless you wanted Java 1.0 to support Unicode 1.100000000000000088817841970012523233890533447265625 :-) Overkill? — bobince, Aug 04 '11 at 20:01
Actually the reason why I omitted Doubles and Floats and numerical representations is because version numbers need not be decimals. Take the first version of Unicode supported - 1.1.5 will simply not parse to a numerical format. — Vineet Reynolds, Aug 04 '11 at 20:14
@Vineet I guess that’s why programming APIs have version objects. — tchrist, Aug 06 '11 at 11:36
@tchrist, yes, and that is missing in this case (and many more) if you need the list of versions mentioned in the `System.getProperty` call. On your other comment, the undue delay between support for Unicode v4.0 and v6.0, can be attributed to the delay in the release of Java 7; since Sun/Oracle changes the TCK only for major releases, support for Unicode 5.x couldn't be brought in an update. This is unlike other languages or even products. I found it quite amusing when the Oracle DB 11g (or maybe 10g R2) supported Unicode 5.2, but the JVM within couldn't. — Vineet Reynolds, Aug 06 '11 at 12:07

score 4 · Answer 3 · edited May 03 '20 at 23:20

4

The Unicode version is defined in the Java Language Specification §3.1. Since J2SE 5.0 Unicode 4.0 is supported.

To quote:

Versions of the Java programming language prior to JDK 1.1 used Unicode 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode 3.0), Java SE 5.0 (to Unicode 4.0), Java SE 7 (to Unicode 6.0), Java SE 8 (to Unicode 6.2), Java SE 9 (to Unicode 8.0), Java SE 11 (to Unicode 10.0), Java SE 12 (to Unicode 11.0), and Java SE 13 (to Unicode 12.1).

edited May 03 '20 at 23:20

Basil Bourque

303,325
100
852
1,154

answered Aug 04 '11 at 12:35

pmnt

379
1
6

[Java 7 supports Unicode 6.0](http://download.oracle.com/javase/7/docs/api/java/lang/Character.html). Unicode 4.0 support is restricted to versions 5.0 and 6.0 of the Java platform. – Vineet Reynolds Aug 04 '11 at 12:55
@Vineet: Supporting merely Unicode 4 from way way back in 2003 is a major liability. I’m surprised that Java 1.6 did not support Unicode 5.0; it could have done so if look at the release dates. It means people had to wait a very very very long time for current Unicode spec support. Almost makes you think somebody forgot about it. – tchrist Aug 06 '11 at 11:36

Michał Šrajer · Answer 4 · 2013-07-12T09:25:54.393

4

I don't think it's available via public API. But this not subject to change very often so you can get the specification version:

System.getProperties().getProperty("java.specification.version")

and on base of that, figure out the unicode version.

java 1.0 -> Unicode 1.1
java 1.1 -> Unicode 2.0
java 1.2 -> Unicode 2.0
java 1.3 -> Unicode 2.0
java 1.4 -> Unicode 3.0
java 1.5 -> Unicode 4.0
java 1.6 -> Unicode 4.0
java 1.7 -> Unicode 6.0

To verify it, you can see the JavaDoc for Character class.

edited Jul 12 '13 at 09:25

answered Aug 04 '11 at 12:45

Michał Šrajer

30,364
7
62
85

Last line: Typo 1.6 --> 1.7 ? – Jul 08 '13 at 12:38

Chris W. Johnson · Answer 5 · 2022-05-01T18:33:44.053

Here's a method I use, which should be compatible with all versions of Java >= 1.1. It's future-proofed only up to Unicode 15.0 (scheduled for release in September 2022), but is easily extended by referring to the Unicode "DerivedAge.txt" file (see the URL in the code comments).

As far back as I've tested, it agrees with the table Michał Šrajer compiled, and it correctly determines Java 8 supports Unicode 6.2, Java 9 supports Unicode 8.0, Java 13 supports Unicode 12.1, and Java 16 supports Unicode 13.0.

/**
 * Gets the <a href="https://www.unicode.org/versions/enumeratedversions.html">Unicode
 * version</a> supported by the current Java runtime. The version is as an {@code int}
 * storing the major and minor version numbers in low-order octets 1 and 0, respectively.
 * It can be converted to dotted-decimal by code such as {@code (version >> 8) + "." +
 * (version & 0xFF)}, and {@code System.out.printf("Unicode version %d.%d%n", version >>
 * 8, version & 0xFF)}.
 * <p>
 * As of 2022-05-01, the most recent Unicode derived age data stops at version 15.0.0d2.
 * Therefore, if this method returns {@code 0xF00}, the Unicode version is 15.0 <i>or
 * greater</i>. Prior version are identified unambiguously.
 * <p>
 * This method is compatible with Java versions >= 1.1.
 *
 * @return Unicode version number {@code int}, storing the major and minor versions in,
 *         respectively, low-order octets 1 and 0. Thus, version 19.2.5 is {@code 0x1302}
 *         (the "update" number, 5, is omitted, because updates cannot add code-points).
 */
public static int getUnicodeVersion() {

/*  Version identification is a descending search for "Character.getType"
    recognition of a new code-point unique to each version. (See
    <https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt>.)
    
    Major and minor versions ("A.B" in version "A.B.C") are identified,
    but not "update" numbers ("C" in prior example), consistent with
    "Unicode Standard Annex #44, Unicode Character Database", revision
    28 (Unicode 14.0.0), section 5.14, which states:
    
    "Formally, the Age property is a catalog property whose enumerated
    values correspond to a list of tuples consisting of a major version
    integer and a minor version integer. The major version is a positive
    integer constrained to the range 1..255. The minor version is a non-
    negative integer constrained to the range 0..255. These range limit-
    ations are specified so that implementations can be guaranteed that
    all valid, assigned Age values can be represented in a sequence of
    two unsigned bytes. A third value corresponding to the Unicode update
    version is not required, because new characters are never assigned in
    update versions of the standard."
    
    Source: <https://www.unicode.org/reports/tr44/#Character_Age>.
*/

//  Preliminary Unicode 15.0 data from
//  <https://www.unicode.org/Public/15.0.0/ucd/DerivedAge-15.0.0d2.txt>.
    
    if (Character.getType('\u0CF3') != Character.UNASSIGNED)
        return 0xF00;    // 15.0, release scheduled for September 2022.
    
    if (Character.getType('\u061D') != Character.UNASSIGNED)
        return 0xE00;    // 14.0, September 2021.
    
    if (Character.getType('\u08Be') != Character.UNASSIGNED)
        return 0xD00;    // 13.0, March 2020.
    
    if (Character.getType('\u32FF') != Character.UNASSIGNED)
        return 0xC01;    // 12.1, May 2019.
    
    if (Character.getType('\u0C77') != Character.UNASSIGNED)
        return 0xC00;    // 12.0, March 2019.
    
    if (Character.getType('\u0560') != Character.UNASSIGNED)
        return 0xB00;    // 11.0, June 2018.
    
    if (Character.getType('\u0860') != Character.UNASSIGNED)
        return 0xA00;    // 10.0, June 2017.
    
    if (Character.getType('\u08b6') != Character.UNASSIGNED)
        return 0x900;     // 9.0, June 2016.
    
    if (Character.getType('\u08b3') != Character.UNASSIGNED)
        return 0x800;     // 8.0, June 2015.
    
    if (Character.getType('\u037f') != Character.UNASSIGNED)
        return 0x700;     // 7.0, June 2014.
    
    if (Character.getType('\u061c') != Character.UNASSIGNED)
        return 0x603;     // 6.3, September 2013.
    
    if (Character.getType('\u20ba') != Character.UNASSIGNED)
        return 0x602;     // 6.2, September 2012.
    
    if (Character.getType('\u058f') != Character.UNASSIGNED)
        return 0x601;     // 6.1, January 2012.
    
    if (Character.getType('\u0526') != Character.UNASSIGNED)
        return 0x600;     // 6.0, October 2010.
    
    if (Character.getType('\u0524') != Character.UNASSIGNED)
        return 0x502;     // 5.2, October 2009.
    
    if (Character.getType('\u0370') != Character.UNASSIGNED)
        return 0x501;     // 5.1, March 2008.
    
    if (Character.getType('\u0242') != Character.UNASSIGNED)
        return 0x500;     // 5.0, July 2006.
    
    if (Character.getType('\u0237') != Character.UNASSIGNED)
        return 0x401;     // 4.1, March 2005.
    
    if (Character.getType('\u0221') != Character.UNASSIGNED)
        return 0x400;     // 4.0, April 2003.
    
    if (Character.getType('\u0220') != Character.UNASSIGNED)
        return 0x302;     // 3.2, March 2002.
    
    if (Character.getType('\u03f4') != Character.UNASSIGNED)
        return 0x301;     // 3.1, March 2001.
    
    if (Character.getType('\u01f6') != Character.UNASSIGNED)
        return 0x300;     // 3.0, September 1999.
    
    if (Character.getType('\u20ac') != Character.UNASSIGNED)
        return 0x201;     // 2.1, May 1998.
    
    if (Character.getType('\u0591') != Character.UNASSIGNED)
        return 0x200;     // 2.0, July 1996.
    
    if (Character.getType('\u0000') != Character.UNASSIGNED)
        return 0x101;     // 1.1, June 1993.
    
    return 0x100;         // 1.0
}

The code for detecting Unicode versions prior to 2.0 will never be reached (given the Java 1.1 or greater requirement), and is present merely for the sake of completeness.

score 1 · Answer 6 · answered Aug 04 '11 at 12:42

Since the supported unicode version is defined by the Java version you might use that information and infer the unicode version based on what System.getProperty("java.version") returns.

I assume you want to support only specific unicode versions or at least some minimum. I'm no unicode expert but since the versions seem to be backward compatible you might define the unicode version to be at least 4.0 which means the supported Java version would be at least 5.0

How to identify programmatically in Java which Unicode version supported?

6 Answers6

Java 1.7 Improvements