22

Are there any scripts, libraries, or programs using Python, or BASH tools (e.g. awk, perl, sed) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn​ nǎo)?

I have found the following examples, but they require PHP or C#:

I have also found various online tools, but they cannot handle a large number of conversions.

smci
  • 32,567
  • 20
  • 113
  • 146
Village
  • 22,513
  • 46
  • 122
  • 163

7 Answers7

24

I've got some Python 3 code that does this, and it's small enough to just put directly in the answer here.

PinyinToneMark = {
    0: "aoeiuv\u00fc",
    1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
    2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
    3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
    4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}

def decode_pinyin(s):
    s = s.lower()
    r = ""
    t = ""
    for c in s:
        if c >= 'a' and c <= 'z':
            t += c
        elif c == ':':
            assert t[-1] == 'u'
            t = t[:-1] + "\u00fc"
        else:
            if c >= '0' and c <= '5':
                tone = int(c) % 5
                if tone != 0:
                    m = re.search("[aoeiuv\u00fc]+", t)
                    if m is None:
                        t += c
                    elif len(m.group(0)) == 1:
                        t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
                    else:
                        if 'a' in t:
                            t = t.replace("a", PinyinToneMark[tone][0])
                        elif 'o' in t:
                            t = t.replace("o", PinyinToneMark[tone][1])
                        elif 'e' in t:
                            t = t.replace("e", PinyinToneMark[tone][2])
                        elif t.endswith("ui"):
                            t = t.replace("i", PinyinToneMark[tone][3])
                        elif t.endswith("iu"):
                            t = t.replace("u", PinyinToneMark[tone][4])
                        else:
                            t += "!"
            r += t
            t = ""
    r += t
    return r

This handles ü, u:, and v, all of which I've encountered. Minor modifications will be needed for Python 2 compatibility.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • `decode_pinyin` is a function. Call it like `decode_pinyin("ni3 hao3")` or read the input from a file or whatever you like. – Greg Hewgill Nov 20 '11 at 09:25
  • If you are allowing multiple syllables, as your "ni3 hao3" example indicates, it would be a good idea to preserve punctuation (at least spaces and apostrophes!) in the output. – John Machin Nov 20 '11 at 21:29
  • That's true. Modifications to support that (which wasn't required by my application) should be straightforward. – Greg Hewgill Nov 21 '11 at 17:23
  • Seen elsewhere (pinyin.info)... changes should: prefer 'a' or 'e' (no pinyin uses both), then prefer 'o', otherwise use the last vowel. Since no pinyin uses both 'o' and 'e', I guess the order of checking in the code is not a problem. – Shenme Jan 04 '12 at 21:36
  • 4
    Thanks for this! Just FYI, the changes necessary for Python 2.x is simply to add a `u` (for unicode) character in front of any strings with the `\u....` characters, that fixed it for me. – Herman Schaaf May 22 '12 at 08:10
  • 1
    @GregHewgill Here you can find a Lua implementation of your script: http://tex.stackexchange.com/a/125128/16071 – susis strolch Jul 23 '13 at 11:29
  • do you also have a flavour for the reverse, ie accented to numbered pinyin? – ccpizza Jul 25 '20 at 19:14
  • don't forget to import re – MintWelsh Nov 14 '21 at 23:56
7

I wrote another Python function that does this, which is case insensitive and preserves spaces, punctuation and other text (unless there are false positives, of course):

# -*- coding: utf-8 -*-
import re

pinyinToneMarks = {
    u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
    u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
    u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
    u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'
}

def convertPinyinCallback(m):
    tone=int(m.group(3))%5
    r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
    # for multple vowels, use first one if it is a/e/o, otherwise use second one
    pos=0
    if len(r)>1 and not r[0] in 'aeoAEO':
        pos=1
    if tone != 0:
        r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
    return r+m.group(2)

def convertPinyin(s):
    return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)

print convertPinyin(u'Ni3 hao3 ma0?')
dani_l
  • 159
  • 1
  • 6
6

The cjklib library does cover your needs:

Either use the Python shell:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
Běijīng

Or just the command line:

$ cjknife -m Bei3jing1
Běijīng

Disclaimer: I developed that library.

Herman Schaaf
  • 46,821
  • 21
  • 100
  • 139
cburgmer
  • 2,150
  • 1
  • 24
  • 18
  • Can cjknife or other function easily convert to diacritics within a file that also contains Hanzi and English ? Or within an Anki DB (which is SQLLite, I believe)? – WGroleau Dec 27 '13 at 23:39
  • `$ cjknife.py -m "ni3hao3吗?Hello"` will return "nǐhǎo吗?Hello". However if you feed in "How are you?" It will complain about missing tones for _you_ which is a valid Pinyin syllable. So I guess you would need to separate English from Pinyin first. – cburgmer Jan 20 '14 at 21:56
  • 1
    When it complains about missing tones, does it still output the original characters? Many people use no digit for the neutral tone. Others use 5 or 0. Does it handle that? And then there is the unfortunate practice of using an ambiguous u: for ü. – WGroleau Mar 22 '14 at 14:07
  • sourceforge points me to cjklib.org for documentation, but I got an error 500 there. – WGroleau Jan 31 '18 at 16:55
5

Updated code: Careful that @Lakedaemon's Kotlin code doesn't contemplate the tone placement rules.

  • A and e trump all other vowels and always take the tone mark. There are no Mandarin syllables in Hanyu Pinyin that contain both a and e.
  • In the combination ou, o takes the mark.
  • In all other cases, the final vowel takes the mark.

I originally ported @Lakedaemon's Kotlin code to Java, now I modified it and urge people who used this or @Lakedaemon's Kotlin code to update it.

I added an extra auxiliary function to get the correct tone mark postion.


    private static int getTonePosition(String r) {
        String lowerCase = r.toLowerCase();

        // exception to the rule
        if (lowerCase.equals("ou")) return 0;

        // higher precedence, both never go together
        int preferencePosition = lowerCase.indexOf('a');
        if (preferencePosition >= 0) return preferencePosition;
        preferencePosition = lowerCase.indexOf('e');
        if (preferencePosition >= 0) return preferencePosition;

        // otherwise the last one takes the tone mark
        return lowerCase.length() - 1;
    }

    static public String getCharacter(String string, int position) {
        char[] characters = string.toCharArray();
        return String.valueOf(characters[position]);
    }

    static public String toPinyin(String asciiPinyin) {
        Map<String, String> pinyinToneMarks = new HashMap<>();
        pinyinToneMarks.put("a", "āáǎà"); pinyinToneMarks.put("e", "ēéěè");
        pinyinToneMarks.put("i", "īíǐì"); pinyinToneMarks.put("o",  "ōóǒò");
        pinyinToneMarks.put("u", "ūúǔù"); pinyinToneMarks.put("ü", "ǖǘǚǜ");
        pinyinToneMarks.put("A",  "ĀÁǍÀ"); pinyinToneMarks.put("E", "ĒÉĚÈ");
        pinyinToneMarks.put("I", "ĪÍǏÌ"); pinyinToneMarks.put("O", "ŌÓǑÒ");
        pinyinToneMarks.put("U", "ŪÚǓÙ"); pinyinToneMarks.put("Ü",  "ǕǗǙǛ");

        Pattern pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])");
        Matcher matcher = pattern.matcher(asciiPinyin);
        StringBuilder s = new StringBuilder();
        int start = 0;

        while (matcher.find(start)) {
            s.append(asciiPinyin, start, matcher.start(1));
            int tone = Integer.parseInt(matcher.group(3)) % 5;
            String r = matcher.group(1).replace("v", "ü").replace("V", "Ü");
            if (tone != 0) {
                int pos = getTonePosition(r);
                s.append(r, 0, pos).append(getCharacter(pinyinToneMarks.get(getCharacter(r, pos)),tone - 1)).append(r, pos + 1, r.length());
            } else {
                s.append(r);
            }
            s.append(matcher.group(2));
            start = matcher.end(3);
        }
        if (start != asciiPinyin.length()) {
            s.append(asciiPinyin, start, asciiPinyin.length());
        }
        return s.toString();
    }

3

With python dragonmapper (pip install dragonmapper):

Hanzi to pinyin

from dragonmapper.transcriptions import hanzi

hanzi.to_pinyin("过河拆桥。")
# >>> 'guòhéchāiqiáo。'

hanzi.to_pinyin("过河拆桥。", accented=False)
# >>> 'guo4he2chai1qiao2。'

Accented pinyin to numbered pinyin

from dragonmapper.transcriptions import accented_to_numbered

accented_to_numbered('guò hé chāi qiáo。')
# >>> 'guo4 he2 chai1 qiao2。'

Numbered pinyin to accented pinyin

from dragonmapper.transcriptions import numbered_to_accented

numbered_to_accented('guo4 he2 chai1 qiao2。')
# >>> 'guò hé chāi qiáo。'

DISCLAIMER: I have no connection with the dragonmapper author

smci
  • 32,567
  • 20
  • 113
  • 146
ccpizza
  • 28,968
  • 18
  • 162
  • 169
1

I ported the code from dani_l to Kotlin (the code in java should be quite similar). It goes :

import java.util.regex.Pattern
val pinyinToneMarks = mapOf(
    'a' to "āáǎà",
    'e' to "ēéěè",
    'i' to "īíǐì",
    'o' to  "ōóǒò",
    'u' to "ūúǔù",
    'ü' to "ǖǘǚǜ",
    'A' to  "ĀÁǍÀ",
    'E' to "ĒÉĚÈ",
    'I' to "ĪÍǏÌ",
    'O' to "ŌÓǑÒ",
    'U' to "ŪÚǓÙ",
    'Ü' to  "ǕǗǙǛ"
)

fun toPinyin(asciiPinyin: String) :String {
  val pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])")!!
  val matcher = pattern.matcher(asciiPinyin)
  val s = StringBuilder()
  var start = 0
  while (matcher.find(start)) {
      s.append(asciiPinyin, start, matcher.start(1))
      val tone = Integer.parseInt(matcher.group(3)!!) % 5
      val r = matcher.group(1)!!.replace("v", "ü").replace("V", "Ü")
      // for multple vowels, use first one if it is a/e/o, otherwise use second one
      val pos = if (r.length >1 && r[0].toString() !in "aeoAEO") 1 else 0
      if (tone != 0) s.append(r, 0, pos).append(pinyinToneMarks[r[pos]]!![tone - 1]).append(r, pos + 1, r.length)
      else s.append(r)
      s.append(matcher.group(2))
      start = matcher.end(3)
  }
  if (start != asciiPinyin.length) s.append(asciiPinyin, start, asciiPinyin.length)
  return s.toString()
}

fun test() = print(toPinyin("Ni3 hao3 ma0?"))
Lakedaemon
  • 831
  • 1
  • 6
  • 11
-1

I came across a VBA macro that does it in Microsoft Word, at pinyinjoe.com

Had a minor flaw which I reported and he responded that he would incorporate my suggestion "as soon as I can" That was early in January 2014; I haven’t had any motivation to check, since it is already done in my copy.

WGroleau
  • 448
  • 1
  • 9
  • 26