0

I need to create packages that contain Unicode characters used only by a specified language. A key requirement for these packages is to make them as small as possible (thus why each package only contains the characters used for its language).

The problem is I can't find a single resource online that specifies the ranges ONLY for a certain language, such as ranges X1-X2, Y3-Y8, etc for Simplified Chinese. Instead everywhere tells me to use CJK (U+4E00 - U+9FFF). I'd like to know which parts of CJK are used for each of the below languages.

I understand that many characters in Asian languages are considered obsolete/unused. Thus they should be excluded from the ranges. The ranges should only include characters used to communicate. I hope that's clear haha..

That being said, the languages I'm try to make these packages for are:

  • Simplified Chinese
  • Traditional Chinese
  • Korean
  • Japanese

Does anyone know the exclusive ranges for these languages or how to find them out?

Rick
  • 421
  • 3
  • 15
  • [1] As it stands I think your question is off topic for SO; it is a question solely about Unicode. Can you modify it to make it related to programming/software development? [2] That said, the question [Unicode range for Japanese](https://stackoverflow.com/q/19899554/2985643) may be helpful. [3] This [Unicode chart](https://www.unicode.org/charts/) might be as well. [4] For the languages you are interested in there is no clean tidy one-to-one mapping between the language and its code block. For example, Japanese Hiragana characters are in a block outside of the CJK range. – skomisa May 05 '22 at 01:08
  • [This SO answer](https://stackoverflow.com/a/2352742/2985643) addresses your concern for Japanese somewhat, but it also makes it clear that the definition of what characters are actually in the Japanese character set is flexible. If you don't know Chinese/Japanese/Korean, you may need the assistance of native speakers to be sure that you are making the right decisions on what to include/omit. – skomisa May 05 '22 at 01:25
  • See also: [1] [What's the complete range for Chinese characters in Unicode?](https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode) [2] A Unicode FAQ on [Chinese and Japanese](https://www.unicode.org/faq/han_cjk.html). Note especially the answer to _"How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?"_, which is: _"**It's basically impossible and largely meaningless**. It's the equivalent of asking if "a" is an English letter or a French one..."_. (Emphasis mine) – skomisa May 05 '22 at 02:20
  • Your task is very specific and it may have very specific trade-off. So you should build yourself such table. Common methods: getting few pages from government, from news agencies, and from wikipedia. Then sort the character uses. Bonus: you get also more frequent special characters (and punctuation). Then it will be your task to define the trade-off point. Note: depending situation, it is common to mix languages (e.g. people names and place names). And if you want to reduce space, just uses system fonts (so you have all characters, and no complex code) – Giacomo Catenazzi May 05 '22 at 06:59

0 Answers0