5

If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:

$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は  助詞,係助詞,*,*,*,*,は,ハ,ワ
猫  名詞,一般,*,*,*,*,猫,ネコ,ネコ
で  助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS

If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.

I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:

$ echo "I am a cat" | mecab
I   名詞,固有名詞,組織,*,*,*,*
am  名詞,一般,*,*,*,*,*
a   名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS

I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".

I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)

There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:

  • MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
  • Is there some other library or tool that can take English and spit out Katakana?
  • Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)

As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.

Hakanai
  • 12,010
  • 10
  • 62
  • 132
  • (1) MeCab treats spaces as stopwords—I'm trying to find a source on this but am failing at the moment. But try putting a space between, say, 吾 and 輩, and you'll see MeCab make this two morphemes. So that's why your English is getting "parsed". – Ahmed Fasih Nov 02 '15 at 14:17
  • (2) English pronunciation is so crazy, but I have used English-to-katakana converters, e.g., http://www.sljfaq.org/cgi/e2k.cgi before. But I think they work using dictionaries (Japanese→English & vice versa), not on any kind of phonetic magic. Is this one of the sites you've contacted? Ben Bullock (of sljfaq.org) is more responsive on the sljfaq mailing list than direct email. – Ahmed Fasih Nov 02 '15 at 14:21
  • @AhmedFasih That is indeed one of the guys I mailed. I know very vaguely that it works using an English to IPA dictionary followed by rules to convert IPA to Katakana. It then has some additional fallback logic for words that aren't in the dictionary. A lot of that is detailed in posts to the mailing list, just not the meat of it, like the location of sources. His GitHub account also has just the issues for the site and none of the code, making me think that hiding the source is quite deliberate. – Hakanai Nov 03 '15 at 05:16
  • Mailed to the faq site mailing list anyway. I guess we'll see. Otherwise I have one hell of a task ahead of me digging up a dictionary and reimplementing all that mapping stuff. :( – Hakanai Nov 03 '15 at 05:33
  • Oh well, he said no. – Hakanai Nov 05 '15 at 00:32
  • I think you can get a English dictionary with pronunciation, like `window:['wɪndəʊ]`. Using pronunciation to katakana mapping is much easier I think. – Mithril Mar 03 '16 at 06:42
  • @Mithril This is the direction I'm currently going. Trying to use MaryTTS to get pronunciation → map that pronunciation to romaji (hard, lots of edge cases) → transliterate to katakana (easy, ICU). One of the problems is that the pronunciation dictionary is missing many entries (it at least has the smarts to guess it from similar entries, though) and some of the ones it does have come out as dubious pronunciations. The other main problem is that things like schwa usually map to more than one sound and MaryTTS doesn't tell me the text that mapped to a syllable so I can't see the original vowel. – Hakanai Mar 04 '16 at 04:17

0 Answers0