Convert numbered pinyin to pinyin with tone marks

Question

Are there any scripts, libraries, or programs using Python, or BASH tools (e.g. awk, perl, sed) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn nǎo)?

I have found the following examples, but they require PHP or C#:

I have also found various online tools, but they cannot handle a large number of conversions.

Greg Hewgill · Accepted Answer · 2011-11-20T08:51:43.087

24

I've got some Python 3 code that does this, and it's small enough to just put directly in the answer here.

PinyinToneMark = {
    0: "aoeiuv\u00fc",
    1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
    2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
    3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
    4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}

def decode_pinyin(s):
    s = s.lower()
    r = ""
    t = ""
    for c in s:
        if c >= 'a' and c <= 'z':
            t += c
        elif c == ':':
            assert t[-1] == 'u'
            t = t[:-1] + "\u00fc"
        else:
            if c >= '0' and c <= '5':
                tone = int(c) % 5
                if tone != 0:
                    m = re.search("[aoeiuv\u00fc]+", t)
                    if m is None:
                        t += c
                    elif len(m.group(0)) == 1:
                        t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
                    else:
                        if 'a' in t:
                            t = t.replace("a", PinyinToneMark[tone][0])
                        elif 'o' in t:
                            t = t.replace("o", PinyinToneMark[tone][1])
                        elif 'e' in t:
                            t = t.replace("e", PinyinToneMark[tone][2])
                        elif t.endswith("ui"):
                            t = t.replace("i", PinyinToneMark[tone][3])
                        elif t.endswith("iu"):
                            t = t.replace("u", PinyinToneMark[tone][4])
                        else:
                            t += "!"
            r += t
            t = ""
    r += t
    return r

This handles ü, u:, and v, all of which I've encountered. Minor modifications will be needed for Python 2 compatibility.

edited Nov 20 '11 at 08:51

answered Nov 20 '11 at 08:39

Greg Hewgill

951,095
183
1,149
1,285

`decode_pinyin` is a function. Call it like `decode_pinyin("ni3 hao3")` or read the input from a file or whatever you like. – Greg Hewgill Nov 20 '11 at 09:25
If you are allowing multiple syllables, as your "ni3 hao3" example indicates, it would be a good idea to preserve punctuation (at least spaces and apostrophes!) in the output. – John Machin Nov 20 '11 at 21:29
That's true. Modifications to support that (which wasn't required by my application) should be straightforward. – Greg Hewgill Nov 21 '11 at 17:23
Seen elsewhere (pinyin.info)... changes should: prefer 'a' or 'e' (no pinyin uses both), then prefer 'o', otherwise use the last vowel. Since no pinyin uses both 'o' and 'e', I guess the order of checking in the code is not a problem. – Shenme Jan 04 '12 at 21:36
4

Thanks for this! Just FYI, the changes necessary for Python 2.x is simply to add a `u` (for unicode) character in front of any strings with the `\u....` characters, that fixed it for me. – Herman Schaaf May 22 '12 at 08:10
1

@GregHewgill Here you can find a Lua implementation of your script: http://tex.stackexchange.com/a/125128/16071 – susis strolch Jul 23 '13 at 11:29
do you also have a flavour for the reverse, ie accented to numbered pinyin? – ccpizza Jul 25 '20 at 19:14
don't forget to import re – MintWelsh Nov 14 '21 at 23:56

score 7 · Answer 2 · answered Jan 31 '14 at 19:37

I wrote another Python function that does this, which is case insensitive and preserves spaces, punctuation and other text (unless there are false positives, of course):

# -*- coding: utf-8 -*-
import re

pinyinToneMarks = {
    u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
    u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
    u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
    u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'
}

def convertPinyinCallback(m):
    tone=int(m.group(3))%5
    r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
    # for multple vowels, use first one if it is a/e/o, otherwise use second one
    pos=0
    if len(r)>1 and not r[0] in 'aeoAEO':
        pos=1
    if tone != 0:
        r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
    return r+m.group(2)

def convertPinyin(s):
    return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)

print convertPinyin(u'Ni3 hao3 ma0?')

score 6 · Answer 3 · edited May 22 '12 at 07:42

6

The cjklib library does cover your needs:

Either use the Python shell:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
Běijīng

Or just the command line:

$ cjknife -m Bei3jing1
Běijīng

Disclaimer: I developed that library.

edited May 22 '12 at 07:42

Herman Schaaf

46,821
21
100
139

answered May 20 '12 at 21:03

cburgmer

2,150
1
24
18

Can cjknife or other function easily convert to diacritics within a file that also contains Hanzi and English ? Or within an Anki DB (which is SQLLite, I believe)? – WGroleau Dec 27 '13 at 23:39
`$ cjknife.py -m "ni3hao3吗？Hello"` will return "nǐhǎo吗？Hello". However if you feed in "How are you?" It will complain about missing tones for _you_ which is a valid Pinyin syllable. So I guess you would need to separate English from Pinyin first. – cburgmer Jan 20 '14 at 21:56
1

When it complains about missing tones, does it still output the original characters? Many people use no digit for the neutral tone. Others use 5 or 0. Does it handle that? And then there is the unfortunate practice of using an ambiguous u: for ü. – WGroleau Mar 22 '14 at 14:07
sourceforge points me to cjklib.org for documentation, but I got an error 500 there. – WGroleau Jan 31 '18 at 16:55

Ezequiel Santiago Sánchez · Answer 4 · 2019-12-02T03:41:10.180

Updated code: Careful that @Lakedaemon's Kotlin code doesn't contemplate the tone placement rules.

A and e trump all other vowels and always take the tone mark. There are no Mandarin syllables in Hanyu Pinyin that contain both a and e.
In the combination ou, o takes the mark.
In all other cases, the final vowel takes the mark.

I originally ported @Lakedaemon's Kotlin code to Java, now I modified it and urge people who used this or @Lakedaemon's Kotlin code to update it.

I added an extra auxiliary function to get the correct tone mark postion.


    private static int getTonePosition(String r) {
        String lowerCase = r.toLowerCase();

        // exception to the rule
        if (lowerCase.equals("ou")) return 0;

        // higher precedence, both never go together
        int preferencePosition = lowerCase.indexOf('a');
        if (preferencePosition >= 0) return preferencePosition;
        preferencePosition = lowerCase.indexOf('e');
        if (preferencePosition >= 0) return preferencePosition;

        // otherwise the last one takes the tone mark
        return lowerCase.length() - 1;
    }

    static public String getCharacter(String string, int position) {
        char[] characters = string.toCharArray();
        return String.valueOf(characters[position]);
    }

    static public String toPinyin(String asciiPinyin) {
        Map<String, String> pinyinToneMarks = new HashMap<>();
        pinyinToneMarks.put("a", "āáǎà"); pinyinToneMarks.put("e", "ēéěè");
        pinyinToneMarks.put("i", "īíǐì"); pinyinToneMarks.put("o",  "ōóǒò");
        pinyinToneMarks.put("u", "ūúǔù"); pinyinToneMarks.put("ü", "ǖǘǚǜ");
        pinyinToneMarks.put("A",  "ĀÁǍÀ"); pinyinToneMarks.put("E", "ĒÉĚÈ");
        pinyinToneMarks.put("I", "ĪÍǏÌ"); pinyinToneMarks.put("O", "ŌÓǑÒ");
        pinyinToneMarks.put("U", "ŪÚǓÙ"); pinyinToneMarks.put("Ü",  "ǕǗǙǛ");

        Pattern pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])");
        Matcher matcher = pattern.matcher(asciiPinyin);
        StringBuilder s = new StringBuilder();
        int start = 0;

        while (matcher.find(start)) {
            s.append(asciiPinyin, start, matcher.start(1));
            int tone = Integer.parseInt(matcher.group(3)) % 5;
            String r = matcher.group(1).replace("v", "ü").replace("V", "Ü");
            if (tone != 0) {
                int pos = getTonePosition(r);
                s.append(r, 0, pos).append(getCharacter(pinyinToneMarks.get(getCharacter(r, pos)),tone - 1)).append(r, pos + 1, r.length());
            } else {
                s.append(r);
            }
            s.append(matcher.group(2));
            start = matcher.end(3);
        }
        if (start != asciiPinyin.length()) {
            s.append(asciiPinyin, start, asciiPinyin.length());
        }
        return s.toString();
    }

Is there a specific reason why you don't use `string.charAt(position)` in `getCharacter()` method? — Thomas, Feb 17 '21 at 20:18
I don't use why I didn't use that function. I guess I missed it so thanks! — Ezequiel Santiago Sánchez, May 15 '22 at 01:33

score 3 · Answer 5 · edited Jan 06 '22 at 10:15

With python dragonmapper (pip install dragonmapper):

Hanzi to pinyin

from dragonmapper.transcriptions import hanzi

hanzi.to_pinyin("过河拆桥。")
# >>> 'guòhéchāiqiáo。'

hanzi.to_pinyin("过河拆桥。", accented=False)
# >>> 'guo4he2chai1qiao2。'

Accented pinyin to numbered pinyin

from dragonmapper.transcriptions import accented_to_numbered

accented_to_numbered('guò hé chāi qiáo。')
# >>> 'guo4 he2 chai1 qiao2。'

Numbered pinyin to accented pinyin

from dragonmapper.transcriptions import numbered_to_accented

numbered_to_accented('guo4 he2 chai1 qiao2。')
# >>> 'guò hé chāi qiáo。'

^{DISCLAIMER: I have no connection with the dragonmapper author}

score 1 · Answer 6 · answered Oct 10 '14 at 07:18

I ported the code from dani_l to Kotlin (the code in java should be quite similar). It goes :

import java.util.regex.Pattern
val pinyinToneMarks = mapOf(
    'a' to "āáǎà",
    'e' to "ēéěè",
    'i' to "īíǐì",
    'o' to  "ōóǒò",
    'u' to "ūúǔù",
    'ü' to "ǖǘǚǜ",
    'A' to  "ĀÁǍÀ",
    'E' to "ĒÉĚÈ",
    'I' to "ĪÍǏÌ",
    'O' to "ŌÓǑÒ",
    'U' to "ŪÚǓÙ",
    'Ü' to  "ǕǗǙǛ"
)

fun toPinyin(asciiPinyin: String) :String {
  val pattern = Pattern.compile("([aeiouüvÜ]{1,3})(n?g?r?)([012345])")!!
  val matcher = pattern.matcher(asciiPinyin)
  val s = StringBuilder()
  var start = 0
  while (matcher.find(start)) {
      s.append(asciiPinyin, start, matcher.start(1))
      val tone = Integer.parseInt(matcher.group(3)!!) % 5
      val r = matcher.group(1)!!.replace("v", "ü").replace("V", "Ü")
      // for multple vowels, use first one if it is a/e/o, otherwise use second one
      val pos = if (r.length >1 && r[0].toString() !in "aeoAEO") 1 else 0
      if (tone != 0) s.append(r, 0, pos).append(pinyinToneMarks[r[pos]]!![tone - 1]).append(r, pos + 1, r.length)
      else s.append(r)
      s.append(matcher.group(2))
      start = matcher.end(3)
  }
  if (start != asciiPinyin.length) s.append(asciiPinyin, start, asciiPinyin.length)
  return s.toString()
}

fun test() = print(toPinyin("Ni3 hao3 ma0?"))

WGroleau · Answer 7 · 2014-03-22T14:00:01.817

-1

I came across a VBA macro that does it in Microsoft Word, at pinyinjoe.com

Had a minor flaw which I reported and he responded that he would incorporate my suggestion "as soon as I can" That was early in January 2014; I haven’t had any motivation to check, since it is already done in my copy.

edited Mar 22 '14 at 14:00

answered Jan 22 '14 at 18:07

WGroleau

448
1
9
26

Convert numbered pinyin to pinyin with tone marks

7 Answers7

Hanzi to pinyin

Accented pinyin to numbered pinyin

Numbered pinyin to accented pinyin

Linked