0

Recently I've been dealing with texts with mixed languages, including Chinese, English, and even some emoticons.

I've been searching for this issue quite a lot, but the only thing I can find is "to replace full-width characters with half-width characters" rather than telling you how to determine whether the character is a half- or full-width word.

So, my question is:

Is it possible to tell whether a word is half-width or full-width?

deadly
  • 1,194
  • 14
  • 24
amigcamel
  • 1,879
  • 1
  • 22
  • 36
  • Usually, if you want to examine text, you will put it into a compatibility decomposition form, such as NFKD. If you do this, fullwidth latin characters become normal (halfwidth), and halfwidth kana/hangul become normal (fullwidth), making it easier to analyze the text. You can do this in python with `import unicodedata; unicodedata.normalize('NKFD', text)`. – Dietrich Epp Jun 09 '12 at 08:45
  • It's NFKD, not NKFD. – martin-k May 07 '15 at 12:28

3 Answers3

2

In unicode 6.1, there is the block Halfwidth and Fullwidth forms, pdf here.

Within this block, \uFF01-\uFF60 and \uFFE0-\uFFE6 are fullwidth, while \uFF61-\uFFDC and \uFFE8-\uFFEE are halfwidth.

beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • \u00F01-\uFF60 is wrong, it should be \uFF01-\uFF60 – Lelouchcr Nov 07 '15 at 09:37
  • @Lelouchcr Fixed! Thanks for the correction 3 years after the answer! – beerbajay Nov 07 '15 at 10:13
  • This isn't really a complete answer. Characters outside of the Halfwidth and Fullwidth forms can be either halfwidth or fullwidth, so using only that blocks chart as a reference leaves the vast majority of Unicode undefined. – Laurence Gonsalves Jan 10 '17 at 01:19
  • 1
    @LaurenceGonsalves There is no standard for width of characters defined; this existing width designation (half/full) only makes sense in certain contexts. See also [this answer](http://stackoverflow.com/a/9145712/320220) about `wcwidth`, which specifies full-width as 2 columns and every other normal character as 1 column wide. – beerbajay Jan 10 '17 at 01:46
1

I think this is a hard question to answer unless you have clear criteria of what is a half-width character and what is a full-width character. If you can decide on that, then you test the characters in the word against certain ranges in Unicode (or any encoding scheme).

The Unicode block Halfwidth and Fullwidth Forms only shows you which characters have alternate forms. For any that do not feature in this block, you have to decide what you consider half- and full-width.

I would imagine that most Western characters are half-width, and most Eastern characters are full-width, but there will be exceptions in both. As this Unicode report highlights, there are also ambiguities.

This proposal includes code that seems to divide characters into full- half- and ambiguous-width. You could use those code points as a starting place.

deadly
  • 1,194
  • 14
  • 24
0

A word is full-width if its characters are full-width. You need to look up the unicode specification and see which character ranges are full-width, then check each character against that.

Amadan
  • 191,408
  • 23
  • 240
  • 301