6

Given a source text like

nin2 hao3 ma

(which is a typical way to write ASCII Pinyin, without proper accentuated characters) and given a (UTF8) conversion table like

a1;ā
e1;ē
i1;ī
o1;ō
u1;ū
ü1;ǖ
A1;Ā
E1;Ē
...

how would I convert the source text into

nín hǎo ma

?

For what it's worth I'm using PHP, and this might be a regex I'm looking into?

Makoto
  • 104,088
  • 27
  • 192
  • 230
Philipp Lenssen
  • 8,818
  • 13
  • 56
  • 77
  • Additional info for those looking into this: (from the wikipedia article http://en.wikipedia.org/wiki/Pinyin) An algorithm to find the correct vowel letter (when there is more than one) is as follows: 1. If there is an "a" or an "e", it will take the tone mark. 2. If there is an "ou", then the "o" takes the tone mark. 3. Otherwise, the second vowel takes the tone mark. – Kristoffer Sall-Storgaard Oct 21 '09 at 15:21

5 Answers5

11

Ollie's algorithm was a nice start, but it didn't apply the marks correctly. For example, qiao1 became qīāō. This one is correct and complete. You can easily see how the replacement rules are defined.

It does the whole thing for tone 5 as well, although it doesn't affect the output, except for deleting the number. I left it in, in case you want to do something with tone 5.

The algorithm works as follows:

  • The word and tone are provided in $match[1] and [2]
  • A star is added behind the letter that should get the accent mark
  • A letter with a star is replaced by that letter with the correct tone mark.

Example:

qiao => (iao becomes ia*o) => qia*o => qiǎo

This strategy, and the use of strtr (which prioritizes longer replacements), makes sure that this won't happen:

qiao1 => qīāō


function pinyin_addaccents($string) {
    # Find words with a number behind them, and replace with callback fn.
    return preg_replace_callback(
        '~([a-zA-ZüÜ]+)(\d)~',
        'pinyin_addaccents_cb',
        $string);
}

# Helper callback
function pinyin_addaccents_cb($match) {
    static $accentmap = null;

    if( $accentmap === null ) {
        # Where to place the accent marks
        $stars =
            'a* e* i* o* u* ü* '.
            'A* E* I* O* U* Ü* '.
            'a*i a*o e*i ia* ia*o ie* io* iu* '.
            'A*I A*O E*I IA* IA*O IE* IO* IU* '.
            'o*u ua* ua*i ue* ui* uo* üe* '.
            'O*U UA* UA*I UE* UI* UO* ÜE*';
        $nostars = str_replace('*', '', $stars);

        # Build an array like Array('a' => 'a*') and store statically
        $accentmap = array_combine(explode(' ',$nostars), explode(' ', $stars));
        unset($stars, $nostars);
    }

    static $vowels =
        Array('a*','e*','i*','o*','u*','ü*','A*','E*','I*','O*','U*','Ü*');

    static $pinyin = Array(
        1 => Array('ā','ē','ī','ō','ū','ǖ','Ā','Ē','Ī','Ō','Ū','Ǖ'),
        2 => Array('á','é','í','ó','ú','ǘ','Á','É','Í','Ó','Ú','Ǘ'),
        3 => Array('ǎ','ě','ǐ','ǒ','ǔ','ǚ','Ǎ','Ě','Ǐ','Ǒ','Ǔ','Ǚ'),
        4 => Array('à','è','ì','ò','ù','ǜ','À','È','Ì','Ò','Ù','Ǜ'),
        5 => Array('a','e','i','o','u','ü','A','E','I','O','U','Ü')
    );

    list(,$word,$tone) = $match;
    # Add star to vowelcluster
    $word = strtr($word, $accentmap);
    # Replace starred letter with accented 
    $word = str_replace($vowels, $pinyin[$tone], $word);
    return $word;
}
Bouke Versteegh
  • 4,097
  • 1
  • 39
  • 35
  • that works great! thanks so much! I have the following issue however: The 'Ǖ' are written like this in my data: 'lu:4' How can I integrate that here? – uncovery Sep 21 '11 at 07:36
  • I helped myself now by addind `$string = str_replace(array('u:', 'U:'), array('ü','Ü'),$string);` before `return preg_replace_callback( ` – uncovery Sep 21 '11 at 09:50
  • 2
    Just ported this idea to javascript and open-sourced a plugin here if anyone is interested: https://github.com/quizlet/pinyin-converter – David Chanin Jul 19 '13 at 23:05
1
<?php
$in = 'nin2 hao3 ma';
$out = 'nín hǎo ma';

function replacer($match) {
  static $trTable = array(
    1 => array(
      'a' => 'ā',
      'e' => 'ē',
      'i' => 'ī',
      'o' => 'ō',
      'u' => 'ū',
      'ü' => 'ǖ',
      'A' => 'Ā',
      'E' => 'Ē'),
    2 => array('i' => 'í'),
    3 => array('a' => 'ǎ')
  );
  list(, $word, $i) = $match;
  return str_replace(
    array_keys($trTable[$i]),
    array_values($trTable[$i]),
    $word); }

// Outputs: bool(true)
var_dump(preg_replace_callback('~(\w+)(\d+)~', 'replacer', $in) === $out);
Ollie Saunders
  • 7,787
  • 3
  • 29
  • 37
  • 1
    does this really take into account all the cases? what about 4th tone? what about `a` in 2nd tone? isn't there something special with `ou`? or were you just trying to provide a piece of code that needs to be expanded? – philfreo Nov 10 '10 at 02:18
  • So what about diphthongs? E.g. liao3 --> liǎo. How does this function figure those out? – imrek Jun 28 '15 at 12:11
1

To add a javascript solution:

This code places Tonemarks according to the official algorithm for placing one, see wikipedia.

Hope that helps some of you, suggestions and improvements wellcome!

    var ACCENTED = {
            '1': {'a': '\u0101', 'e': '\u0113', 'i': '\u012B', 'o': '\u014D', 'u': '\u016B', 'ü': '\u01D6'},
            '2': {'a': '\u00E1', 'e': '\u00E9', 'i': '\u00ED', 'o': '\u00F3', 'u': '\u00FA', 'ü': '\u01D8'},
            '3': {'a': '\u01CE', 'e': '\u011B', 'i': '\u01D0', 'o': '\u01D2', 'u': '\u01D4', 'ü': '\u01DA'},
            '4': {'a': '\u00E0', 'e': '\u00E8', 'i': '\u00EC', 'o': '\u00F2', 'u': '\u00F9', 'ü': '\u01DC'},
            '5': {'a': 'a', 'e': 'e', 'i': 'i', 'o': 'o', 'u': 'u', 'ü': 'ü'}
    };

     function getPos (token) {
            if (token.length === 1){
                // only one letter, nothing to differentiate
                return 0;
            }
            var precedence = ['a', 'e', 'o'];
            for (i=0; i<precedence.length; i += 1){
                var pos = token.indexOf(precedence[i]);
                // checking a before o, will take care of ao automatically
                if (pos >= 0){
                    return pos;
                }
            }
            var u = token.indexOf('u');
            var i = token.indexOf('i');
            if (i < u){
                // -iu OR u-only case, accent goes to u
                return u;
            } else {
                // -ui OR i-only case, accent goes to i
                return i;
            }
            // the only vowel left is ü
            var ü = token.indexOf('ü');
            if (ü >= 0){
                return ü;
            }
        }

     //and finally:
     // we asume the input to be valid PinYin, therefore no security checks....
     function placeTone(numbered_PinYin){
              var ToneIndex = numbered_PinYin.charAt(numbered_PinYin.length -1);
              var accentpos = getPos(numbered_PinYin);
              var accented_Char = ACCENTED[ToneIndex][numbered_PinYin.charAt(accentpos)];

              var accented_PinYin = "";
              if (accentpos === 0){
                  // minus one to trimm the number off
                  accented_PinYin = accented_Char + numbered_PinYin.substr(1, numbered_PinYin.length-1); 
              } else {
                  var before = numbered_PinYin.substr(0, accentpos);
                  var after = numbered_PinYin.substring(accentpos+1, numbered_PinYin.length-1);
                  accented_PinYin = before + accented_Char + after;
               }
               return accented_PinYin;
    }

    console.log(placeTone('han4 zi4'));
JasonL
  • 34
  • 3
tike
  • 2,234
  • 17
  • 19
1

For a .NET solution try Pinyin4j.NET

Features Convert Chinese (both Simplified and Traditional) to most popular pinyin systems. Supporting pinyin system are listed below.

  • Hanyu Pinyin 汉语拼音
  • Tongyong Pinyin 通用拼音
  • Wade-Giles 威妥玛拼音
  • MPS2 (Mandarin Phonetic Symbols II) 国语注音符号第二式
  • Yale Romanization 耶鲁罗马化拼音
  • Gwoyeu Romatzyh国语国语罗马化拼音
Dale
  • 12,884
  • 8
  • 53
  • 83
0

VB Macro (Libre)Office : Convert pinyin tone numbers to accents

Hopefully the algorithm is correct accordingly to pinyin rules specially for i and u.

    sub replaceNumberByTones

    call PinyinTonesNumber("a([a-z]*[a-z]*)0", "a$1")
    call PinyinTonesNumber("a([a-z]*[a-z]*)1", "a$1")
    call PinyinTonesNumber("a([a-z]*[a-z]*)2", "á$1")
    call PinyinTonesNumber("a([a-z]*[a-z]*)3", "a$1")
    call PinyinTonesNumber("a([a-z]*[a-z]*)4", "à$1")

    call PinyinTonesNumber("o([a-z]*[a-z]*)0", "o$1")
    call PinyinTonesNumber("o([a-z]*[a-z]*)1", "o$1")
    call PinyinTonesNumber("o([a-z]*[a-z]*)2", "ó$1")
    call PinyinTonesNumber("o([a-z]*[a-z]*)3", "o$1")
    call PinyinTonesNumber("o([a-z]*[a-z]*)4", "ò$1")

    call PinyinTonesNumber("e([a-z]*[a-z]*)0", "e$1")
    call PinyinTonesNumber("e([a-z]*[a-z]*)1", "e$1")
    call PinyinTonesNumber("e([a-z]*[a-z]*)2", "é$1")
    call PinyinTonesNumber("e([a-z]*[a-z]*)3", "e$1")
    call PinyinTonesNumber("e([a-z]*[a-z]*)4", "è$1")

    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)0", "u$1")
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)1", "u$1")
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)2", "ú$1")
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)3", "u$1")
    call PinyinTonesNumber("u([a-hj-z]*[a-hj-z]*)4", "ù$1")

    call PinyinTonesNumber("i([a-z]*[a-z]*)0", "i$1")
    call PinyinTonesNumber("i([a-z]*[a-z]*)1", "i$1")
    call PinyinTonesNumber("i([a-z]*[a-z]*)2", "í$1")
    call PinyinTonesNumber("i([a-z]*[a-z]*)3", "i$1")
    call PinyinTonesNumber("i([a-z]*[a-z]*)4", "ì$1")

    End sub

    sub PinyinTonesNumber(expression, replacement)
    rem ----------------------------------------------------------------------
    rem define variables
    dim document   as object
    dim dispatcher as object
    rem ----------------------------------------------------------------------
    rem get access to the document
    document   = ThisComponent.CurrentController.Frame
    dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")

    rem ----------------------------------------------------------------------
    dim args1(18) as new com.sun.star.beans.PropertyValue
    args1(0).Name = "SearchItem.StyleFamily"
    args1(0).Value = 2
    args1(1).Name = "SearchItem.CellType"
    args1(1).Value = 0
    args1(2).Name = "SearchItem.RowDirection"
    args1(2).Value = true
    args1(3).Name = "SearchItem.AllTables"
    args1(3).Value = false
    args1(4).Name = "SearchItem.Backward"
    args1(4).Value = false
    args1(5).Name = "SearchItem.Pattern"
    args1(5).Value = false
    args1(6).Name = "SearchItem.Content"
    args1(6).Value = false
    args1(7).Name = "SearchItem.AsianOptions"
    args1(7).Value = false
    args1(8).Name = "SearchItem.AlgorithmType"
    args1(8).Value = 1
    args1(9).Name = "SearchItem.SearchFlags"
    args1(9).Value = 65536
    args1(10).Name = "SearchItem.SearchString"
    args1(10).Value = expression
    args1(11).Name = "SearchItem.ReplaceString"
    args1(11).Value = replacement
    args1(12).Name = "SearchItem.Locale"
    args1(12).Value = 255
    args1(13).Name = "SearchItem.ChangedChars"
    args1(13).Value = 2
    args1(14).Name = "SearchItem.DeletedChars"
    args1(14).Value = 2
    args1(15).Name = "SearchItem.InsertedChars"
    args1(15).Value = 2
    args1(16).Name = "SearchItem.TransliterateFlags"
    args1(16).Value = 1280
    args1(17).Name = "SearchItem.Command"
    args1(17).Value = 3
    args1(18).Name = "Quiet"
    args1(18).Value = true

    dispatcher.executeDispatch(document, ".uno:ExecuteSearch", "", 0, args1())


    end sub

Hope this helps someone

François

Francois
  • 31
  • 2