3

I would like to analyze full width or half width character in char array.

for example:

char [] password = {'t','e','s','t','思','題'};

There are full width and half width characters in this char array.

half width = t,e,s,t

full width = 思,題

So, how can I analyze full width or half width for char array in java?

Thanks a lot!

Chan Myae Thu
  • 1,054
  • 1
  • 9
  • 9
  • 2
    What definition are you using for "full width" and "half width"? – Jeff Nov 22 '12 at 02:30
  • So Chinese characters are full width and English character are half with? Is that right to say? How about any other language alphabet other than Chinese? so why don't you store your Chinese alphabet and do a comparison against your stores for inputs? that's the simplest logical view unless you have other intentions which require more complex logic. – bonCodigo Nov 22 '12 at 02:38
  • 1
    What do you mean by analyze? – kaneda Nov 22 '12 at 02:38
  • 1
    See also [`Character`](http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html). – trashgod Nov 22 '12 at 02:43
  • 2
    Back in the days, terminals display fix-sized western characters only. Oriental countries ported them, and use same height, but double width, to display oriental characters. these are called full-width. Today we still use text only consoles, so I guess OP still needs to figure it out. – irreputable Nov 22 '12 at 02:46
  • What is the goal? Is it for figuring out how large it will display in a window? Or is it for some other sort of calculation? – TofuBeer Nov 22 '12 at 02:53

5 Answers5

6

The width of an East Asian character is described in Annex #11 of the Unicode Standard which talks about the East_Asian_Width property of a Unicode character.

Although, I could find no way of inquiring this property using standard Java 8 libraries, one can use the ICU4J library (com.ibm.icu.icu4j in Maven) to get this value.

For example, the following code returns UCharacter.EastAsianWidth.WIDE:

int esw = UCharacter.getIntPropertyValue('あ', UProperty.EAST_ASIAN_WIDTH);

Some testing with Japanese characters has shown that all single-byte Shift JIS kana characters (e.g. halfwidth ) are designated HALFWIDTH, while their fullwidth counterparts (e.g. ) are designated FULLWIDTH. All other fullwidth characters, such as あいうえお return WIDE, and non-fullwidth characters such as plain Abc return NARROW.

The value AMBIGUOUS needs some extra care because its widths will vary depending on context. For instance, the vim editor has an ambiwidth option to let the user choose whether it should be treated narrow or wide, since rendering is terminal dependent.

The aforementioned annex states for ambiguous characters: Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (i.e., normal-width) characters in non-East Asian usage.

It also states for NEUTRAL: Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.

However, I have found the Narrow for NEUTRAL not always the case, as some characters can appear wide in editors I have tried. Furthermore, , , , are AMBIGUOUS, while the proceeding characters and are NEUTRAL and this doesn't seem to make sense. Perhaps characters not mapped in icu4j fall back to NEUTRAL.

Lastly, UCharacter.EastAsianWidth.COUNT is just a constant representing the number of properties defined under UCharacter.EastAsianWidth, and not a value getIntPropertyValue() will return.

antak
  • 19,481
  • 9
  • 72
  • 80
5

JDK contains one class that mentions full/half width: InputSubset

http://docs.oracle.com/javase/7/docs/api/java/awt/im/InputSubset.html

Unfortunately there's no method to check which char falls in which subset.

Nonetheless, apparently full/half width is a well defined concept for unicodes. There maybe an accurate spec somewhere on internet.

http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

http://en.wikipedia.org/wiki/DBCS

I guess it'll be good enough for your use case to say that, 0x00-0xFF chars are half-width; other chars are full-width, except the half-width chars in the unicode block "Halfwidth and Fullwidth Forms"

boolean isHalfWidth(char c)
{
    return '\u0000' <= c && c <= '\u00FF'
        || '\uFF61' <= c && c <= '\uFFDC'
        || '\uFFE8' <= c && c <= '\uFFEE' ;
}
irreputable
  • 44,725
  • 9
  • 65
  • 93
1

The visible width of a character really depends on the font that you view it in, and the characters in Java are abstract with respect to fonts.

If you're looking to determine whether a particular character is a CJK (or language subset etc.) character, you might try finding the bit-pattern range that those characters take in UTF-16 (I think that's what java uses?) and making sure that each char value falls within that range.

I may be completely barking up the wrong tree here though, so let me know if this is what you're after.

EDIT: actually, now I'm not sure that the java encoding is entirely abstract, after looking at trashgod's link. The char comparisons may still be a good way to go, though, as there are definitions of full-width hex codes in the character documentation.

Jeff
  • 12,555
  • 5
  • 33
  • 60
  • 1
    [`UnicodeBlock`](http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeBlock.html) is a more explicit way of checking for CJK characters. That said you need to make sure to work with code points, not `char`s, since CJK characters are outside the BMP. – millimoose Nov 22 '12 at 02:53
  • @millimoose you should totally make that an answer. I'd upvote it. – Jeff Nov 22 '12 at 03:00
  • It's still completely unclear what the heck the OP is actually trying to do though. – millimoose Nov 22 '12 at 03:03
0

You appear to be talking about the number of bits in the internal representation of a character, as opposed to the "visible width" referred to in another answer.

The Character class and the char primitive type in Java both use standard Unicode; it handles latin, Chinese, and many other languages. Some unicode characters are 16 bits; some are more.

So I think the answer to your question is: go ahead and analyze however you want -- your array contains some 16-bit values and probably some values greater than 16 bits. Without knowing more about what you want to do with the characters, it is hard to be any more explicit.

EDIT: my mistake, the char primitive only handles 16-bit unicode values. But an array of Character objects would handle unicode values greater than 16 bits.

arcy
  • 12,845
  • 12
  • 58
  • 103
  • An array of `char`s or `Character`s is the wrong choice here. You need to use a `String`, and walk it using [`codePointAt()](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointAt(int)). (This would be somewhat tricky because you need to advance the index based on the width of the last character.) – millimoose Nov 22 '12 at 03:06
  • Why would a String be a better choice, and why wouldn't you just use charAt() if it was? – arcy Nov 22 '12 at 13:11
  • `char` is a 16-bit data type that can only represent characters from the BMP; most of the available CJK characters are outside the BMP. Java's strings stores these as UTF-16 surrogate pairs. The "native" way of working with characters that might be outside the BMP is using the full 32-bit code points, represented with `int`s. (As the static methods on `Character` do.) Using a `char[]` or a `Character[]` or `String.charAt()` would mean that you'd have to check if the current `char` is a surrogate, and compose it and the next `char` into a codepoint using `Character.toCodePoint()`. – millimoose Nov 22 '12 at 14:38
  • 1
    `String` is more convenient, because it already provides `codePointAt()` which automatically assembles the surrogate pairs for you. And for character-by-character processing, you can do everything with `int` codepoints that you can with `char`s, except maybe output them to a `Writer`. (A curious omission. There's `Writer.write(int)` but it ignores the high-order bits.) – millimoose Nov 22 '12 at 14:45
  • You're also wrong in claiming that a `Character[]` could handle non-BMP characters. A `Character` object is merely a wrapper around a `char` and can't represent anything more than a `char` can. The `Character` class does serve as a namespace for static utility methods that can handle 32-bit codepoints – but those methods take `int` parameters. They don't take `Character` parameters, nor are they instance methods of `Character` objects. – millimoose Nov 22 '12 at 14:47
  • thanks for the correction -- I had read that char/Character was based on the (original) unicode standard, and didn't read far enough that the standard has changed. – arcy Nov 22 '12 at 23:26
0

It really depends on how you define what full width character is. The internal representation of Java String is UTF-16, so each of the character is ranged from 1 to 2^16. If you define full width character using the definition of unicode, you can just check whether the char is within the range of the block of full width chracter of unicode. But that block do not includes some common text in Chinese such as ‵。

code4j
  • 4,208
  • 5
  • 34
  • 51