2

I'm using a .NET port of Mecab (called NMecab) to try to parse Japanese Hiragana, Katakana, and Kanji to romaji.

Here's my code:

using NMeCab;    
MeCabTagger _tagger;

public string Parse(string input)
{
    _tagger = MeCabTagger.Create();
    _tagger.OutPutFormatType = "lattice";
    _tagger.LatticeLevel = MeCabLatticeLevel.Two;


    var output = _tagger.Parse(input);

    return output;
}

When I call Parse(input) using the following Japanese text: "ども"

I get the output: "ども助詞,接続助詞,,,,,ども,ドモ,ドモ EOS"

I'm looking for the romaji of "ども", which would be "domo."

I've tried to use Mecab directly as discussed in this SO answer, but get the same output.

Community
  • 1
  • 1
Chaddeus
  • 13,134
  • 29
  • 104
  • 162

1 Answers1

2

To my knowledge none of the dictionaries used by MeCab (IPA, Jumandic, or Unidic) includes romaji transcription of words. And actually there is no need for that:

  1. There exist different transcription schemes (e.g. Hepburn, kunrei, 99 siki);

  2. Information on the pronunciation of lexical units is already available (e.g. ドモ).

You have to write your own transcription routine... or look for an existing katakana-romaji transcription module (compatible with your transcription scheme)...

Pierre
  • 1,204
  • 8
  • 15
  • 1
    Gotcha. Thanks... thought MeCab handled the romaji translation. Instead it looks like it simply converts kanji down to hiragana/katakana. Then I just roll my own hiragana/katakana conversion. – Chaddeus Jun 01 '14 at 22:34
  • 1
    Actually the hiragana/katakana transcription is part of the dictionary... you can have a look at the IPA dictionary source files (*.csv). – Pierre Jun 02 '14 at 09:47