6

I've found places on the web such as http://www.chinesetopinyin.com that convert Chinese characters to pinyin (romanization).

Does anyone know how to do this, or have a database that can be parsed?


EDIT: I'm using C# but would actually prefer a database/flatfile.

Red
  • 26,798
  • 7
  • 36
  • 58
Mass
  • 135
  • 1
  • 3
  • 7

1 Answers1

6

possible solution using Python:

I think that Unicode database contains pinyin romanizations for chinese characters, but these are not included in unicodedata module data.

however, you can use some external libraries, like cjklib, example:

# coding: UTF-8
import cjklib
from cjklib.characterlookup import CharacterLookup

c = u'好'

cjk = CharacterLookup('T')
readings = cjk.getReadingForCharacter(c, 'Pinyin')
for r in readings:
    print r

output:

hāo
hǎo
hào

UPDATE

cjklib comes with an standalone cjknife utility, which micht help. some usage is described here

mykhal
  • 19,175
  • 11
  • 72
  • 80
  • 1
    .. and if you want ascii-only or numeric representation, you may find how to do it in documentation, or you can pick the first pinyin and remove the accents: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string – mykhal Aug 26 '10 at 02:48
  • Unicode does have a table for Character to Pinyin mapping, it's called Unihan and has loads of data. :) – cburgmer May 20 '12 at 20:53
  • `raise ValueError, 'unknown locale: %s' % localename ValueError: unknown locale: UTF-8` any idea ? – jokoon Sep 16 '12 at 22:53
  • jokoon: i don't know.. what are you getting from `import locale; locale.getlocale()`? – mykhal Sep 16 '12 at 23:35