Kotlin/Java - How to identify full width characters?

Question

TL;DR:

Half-width: Regular width characters.
Eg. 'A' and 'ﾆ'

Full-width: Chars that take two monospaced English chars' space on the display
Eg. '中', 'に' and 'Ａ'

I need an implementation of this function:

/**
 * @return Is this character a full-width character or not.
 */
fun Char.isFullWidth(): Boolean
{
    // What is the most efficient implementation here?
}

No this is not about data structures for those chars, it's only about the displayed width.

Long Story:

I'm refactoring HyLogger, a logging library focused on text-coloring with gradients. Here is the problem I ran into:

If you look at the first gradient text block printed in the screenshot, the full-width text in the middle messed up the gradient pattern after it, because when calling string.length, they are counted as one character even though they take up twice the size.

You might be asking, why on earth would anyone print full-width characters? This is a real problem because almost all characters in languages like Chinese, Japanese, or Korean are full-width, therefore takes twice the space, similar to the English full-width characters.

So I need a way to identify full-width characters so that I can calculate them as two gradient-pixels instead of one to solve the problem in the picture.

Known Info:

C++ check if unicode character is full width :

There is a list of East Asian Width characters on the Unicode website (and also the report), but it's probably not efficient to traverse this entire list for every single character when rendering a gradient text block.
Python has this Unicode database library, one possible solution is to call python API using Jython, which would be heavy and the efficiency is probably not very good.

Analyzing full width or half width character in Java :

The ICU4J library has Unicode tools to achieve this function, but that library is 12.5 MB large, which isn't optimal for my 50 KB logger library.

*Display* width depends on the font, so select a monospaced font and calculate the width of each character. See e.g. [Calculate the display width of a string in Java](https://stackoverflow.com/q/258486/5221149). — Andreas, Jul 09 '20 at 00:09
@Andreas Thanks for the input. But there are standards for "half-width" and "full-width" defined by Unicode, and with a monospaced font, a half-width char would have half the width of the full-width char. And using `Graphics` to determine individual char width doesn't seem efficient. — Hykilpikonna, Jul 09 '20 at 00:13
If you want to query Unicode properties not available in Java, why not use the library mentioned in the [second answer](https://stackoverflow.com/a/35665683/5221149) to the last link you provided? — Andreas, Jul 09 '20 at 00:19
@Andreas Sorry that I didn't see it. That would be the best solution so far, but the entire library is 12.5 MB large, which isn't optimal for my 50 KB logger library. Thanks anyway! I'll try to manually import a small portion of the library for this one function. — Hykilpikonna, Jul 09 '20 at 00:34

Hykilpikonna · Accepted Answer · 2020-07-11T01:50:24.113

The best solution seems to be converting EastAsianWidth.txt to a series of range conditions.

The below function is partially generated with FullWidthUtilGenerator.kt, and it still has some issues to resolve:

It does not account for characters outside the Basic Multilingual Plane (BMP) range (Eg. U+10000) because I haven't figured out how to effectively include them in Java/Kotlin.
(\u10000 gives compilation error)
Near values that are stated separately in EastAsianWidth.txt are not automatically combined yet. (Eg. \u3010 and \u3011)

/**
 * Half-width: Regular width characters.
 * Eg. 'A' and 'ﾆ'
 *
 * Full-width: Chars that take two monospaced English chars' space on the display
 * Eg. '中', 'に' and 'Ａ'
 * 
 * See FullWidthUtilGenerator.kt
 *
 * @return Is this character a full-width character or not.
 */
fun Char.isFullWidth(): Boolean
{
    return when (this)
    {
        '\u2329','\u232A','\u23F0','\u23F3','\u267F','\u2693','\u26A1','\u26CE','\u26D4','\u26EA','\u26F5',
        '\u26FA','\u26FD','\u2705','\u2728','\u274C','\u274E','\u2757','\u27B0','\u27BF','\u2B50','\u2B55',
        '\u3000','\u3004','\u3005','\u3006','\u3007','\u3008','\u3009','\u300A','\u300B','\u300C','\u300D',
        '\u300E','\u300F','\u3010','\u3011','\u3014','\u3015','\u3016','\u3017','\u3018','\u3019','\u301A',
        '\u301B','\u301C','\u301D','\u3020','\u3030','\u303B','\u303C','\u303D','\u303E','\u309F','\u30A0',
        '\u30FB','\u30FF','\u3250','\uA015','\uFE17','\uFE18','\uFE19','\uFE30','\uFE35','\uFE36','\uFE37',
        '\uFE38','\uFE39','\uFE3A','\uFE3B','\uFE3C','\uFE3D','\uFE3E','\uFE3F','\uFE40','\uFE41','\uFE42',
        '\uFE43','\uFE44','\uFE47','\uFE48','\uFE58','\uFE59','\uFE5A','\uFE5B','\uFE5C','\uFE5D','\uFE5E',
        '\uFE62','\uFE63','\uFE68','\uFE69','\uFF04','\uFF08','\uFF09','\uFF0A','\uFF0B','\uFF0C','\uFF0D',
        '\uFF3B','\uFF3C','\uFF3D','\uFF3E','\uFF3F','\uFF40','\uFF5B','\uFF5C','\uFF5D','\uFF5E','\uFF5F',
        '\uFF60','\uFFE2','\uFFE3','\uFFE4',
        in '\u1100'..'\u115F',in '\u231A'..'\u231B',in '\u23E9'..'\u23EC',in '\u25FD'..'\u25FE',
        in '\u2614'..'\u2615',in '\u2648'..'\u2653',in '\u26AA'..'\u26AB',in '\u26BD'..'\u26BE',
        in '\u26C4'..'\u26C5',in '\u26F2'..'\u26F3',in '\u270A'..'\u270B',in '\u2753'..'\u2755',
        in '\u2795'..'\u2797',in '\u2B1B'..'\u2B1C',in '\u2E80'..'\u2E99',in '\u2E9B'..'\u2EF3',
        in '\u2F00'..'\u2FD5',in '\u2FF0'..'\u2FFB',in '\u3001'..'\u3003',in '\u3012'..'\u3013',
        in '\u301E'..'\u301F',in '\u3021'..'\u3029',in '\u302A'..'\u302D',in '\u302E'..'\u302F',
        in '\u3031'..'\u3035',in '\u3036'..'\u3037',in '\u3038'..'\u303A',in '\u3041'..'\u3096',
        in '\u3099'..'\u309A',in '\u309B'..'\u309C',in '\u309D'..'\u309E',in '\u30A1'..'\u30FA',
        in '\u30FC'..'\u30FE',in '\u3105'..'\u312F',in '\u3131'..'\u318E',in '\u3190'..'\u3191',
        in '\u3192'..'\u3195',in '\u3196'..'\u319F',in '\u31A0'..'\u31BF',in '\u31C0'..'\u31E3',
        in '\u31F0'..'\u31FF',in '\u3200'..'\u321E',in '\u3220'..'\u3229',in '\u322A'..'\u3247',
        in '\u3251'..'\u325F',in '\u3260'..'\u327F',in '\u3280'..'\u3289',in '\u328A'..'\u32B0',
        in '\u32B1'..'\u32BF',in '\u32C0'..'\u32FF',in '\u3300'..'\u33FF',in '\u3400'..'\u4DBF',
        in '\u4E00'..'\u9FFC',in '\u9FFD'..'\u9FFF',in '\uA000'..'\uA014',in '\uA016'..'\uA48C',
        in '\uA490'..'\uA4C6',in '\uA960'..'\uA97C',in '\uAC00'..'\uD7A3',in '\uF900'..'\uFA6D',
        in '\uFA6E'..'\uFA6F',in '\uFA70'..'\uFAD9',in '\uFADA'..'\uFAFF',in '\uFE10'..'\uFE16',
        in '\uFE31'..'\uFE32',in '\uFE33'..'\uFE34',in '\uFE45'..'\uFE46',in '\uFE49'..'\uFE4C',
        in '\uFE4D'..'\uFE4F',in '\uFE50'..'\uFE52',in '\uFE54'..'\uFE57',in '\uFE5F'..'\uFE61',
        in '\uFE64'..'\uFE66',in '\uFE6A'..'\uFE6B',in '\uFF01'..'\uFF03',in '\uFF05'..'\uFF07',
        in '\uFF0E'..'\uFF0F',in '\uFF10'..'\uFF19',in '\uFF1A'..'\uFF1B',in '\uFF1C'..'\uFF1E',
        in '\uFF1F'..'\uFF20',in '\uFF21'..'\uFF3A',in '\uFF41'..'\uFF5A',in '\uFFE0'..'\uFFE1',
        in '\uFFE5'..'\uFFE6' -> true
        else -> false
    }
}

Kotlin/Java - How to identify full width characters?

TL;DR:

Long Story:

Known Info:

1 Answers1