11

Where can I find a Unicode table showing only the simplified Chinese characters? I have searched everywhere but cannot find anything.

UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?

I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.

If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?

UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/

Makoto
  • 104,088
  • 27
  • 192
  • 230
cmann
  • 1,920
  • 4
  • 21
  • 33

6 Answers6

17

The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:

U+673A  kTraditionalVariant     U+6A5F
U+6A5F  kSimplifiedVariant      U+673A

In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).

Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:

宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/

The first column is traditional characters, and the second column is simplified.

To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Where exactly can I find Unihan_Variants.txt? – cmann Jan 04 '11 at 19:54
  • So if I were to use the Unihan_Variants.txt file, I would simply find each line with kTraditionalVariant and use the code at the beginning of the line and this should give me all the simplified unicode characters? – cmann Jan 04 '11 at 21:42
  • 1
    @cmann: The latest Unihan database is here: [`Unihan.zip`](http://www.unicode.org/Public/UNIDATA/Unihan.zip). Note that only some characters have both traditional and simplified variants, therefore not all characters even *have* an entry in `Unihan_Variants.txt`. I suppose it depends on whether you want "all the characters used in Simplified Chinese", or "only the simplified characters where they are different from traditional". – Greg Hewgill Jan 05 '11 at 00:42
  • I suppose it's probably unnecessary to have ALL the characters, I'm sure just the most common ones should be enough? Maybe something along the lines of the characters taught at Chinese schools? – cmann Jan 05 '11 at 07:04
  • 1
    @cmann: In *that* case, have a look at the [Hanyu Shuiping Kaoshi](http://en.wikipedia.org/wiki/Hanyu_Shuiping_Kaoshi) word lists. These are official proficiency tests for Chinese within the PRC. – Greg Hewgill Jan 05 '11 at 09:13
  • But that seems to cover both simplified and traditional characters. – cmann Jan 05 '11 at 10:15
  • I guess I just assumed because there was no mention that it was _only_ simplified characters. – cmann Jan 06 '11 at 08:00
  • @cmann: The vast majority of Chinese characters have only *one* form, which is used in both "traditional" and "simplified" contexts. For a few characters, the PRC government made "simplified" characters to help improve literacy rates. You can't write very much using *only* simplified characters, because simplified characters are a very small part of the whole character set. As you can see in my example above, 宕 appears in the same form in both traditional and simplified contexts. – Greg Hewgill Jan 06 '11 at 10:47
  • Thanks for the info you've been a lot of help; I think I have all the information I need – cmann Jan 06 '11 at 21:03
10

The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.

https://github.com/jpatokal/script_detector

Sample:

p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false

But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:

p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false

Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.

lambshaanxy
  • 22,552
  • 10
  • 68
  • 92
  • 1
    This is a great work! The codepoint list file (https://github.com/jpatokal/script_detector/blob/master/lib/chinese_detector.rb) is a wonderful work. Wondering why a few upvotes were given to this answer... – cxwangyi Nov 18 '14 at 23:41
1

I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.

arnsholt
  • 851
  • 7
  • 17
1

Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.

https://pastebin.com/xw4p7RVJ

You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.

If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese

MrMesees
  • 1,488
  • 19
  • 27
  • The pastebin link is gone, can you re-post a link? – Mike Maxwell Feb 08 '21 at 18:46
  • I Actually don't know where it is if they've deleted it. I did mark it to live forever. Some pleb on the internet has likely griefed with a copyright claim. If I run into it on my travels I'll re-post. I Have it somewhere but where might be another question. – MrMesees Feb 09 '21 at 07:15
0

According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.

Michael Lowman
  • 3,000
  • 1
  • 20
  • 34
  • Surely this is not impossible? Inside the Flash IDE for example, you can select Chinese Traditional, Chinese Simplified or Chinese All. How do they do it? – cmann Jan 04 '11 at 20:05
  • Well, a font choice would cover the glyph choice. So when a particular codepoint is available in multiple styles a simplified chinese font would show the simplified chinese glyph. – Michael Lowman Jan 04 '11 at 20:29
  • Greg's answer is completely accurate; the page linked is the main page. It has a web interface to the database but the backing files are linked on the page: "For access to the most recent version of the raw data files (Unihan.zip), see http://www.unicode.org/Public/UNIDATA/." – Michael Lowman Jan 04 '11 at 20:31
0

I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • 1
    I second this after read this range. I am a native Chinese knowing both simplified and traditional Chinese. – cxwangyi Nov 18 '14 at 23:35