If I begin with a wholly Japanese sentence and run it through MeCab, I get something like this:
$ echo "吾輩は猫である" | mecab
吾輩 名詞,代名詞,一般,*,*,*,吾輩,ワガハイ,ワガハイ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
猫 名詞,一般,*,*,*,*,猫,ネコ,ネコ
で 助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある 助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
EOS
If I smash together everything I get from the last column, I get "ワガハイワネコデアル", which I can then feed into a speech synthesis program and get output. Said program, however, doesn't handle English words.
I throw English into MeCab, it manages to tokenise it (probably naively at the spaces), but gives no reading:
$ echo "I am a cat" | mecab
I 名詞,固有名詞,組織,*,*,*,*
am 名詞,一般,*,*,*,*,*
a 名詞,一般,*,*,*,*,*
cat 名詞,固有名詞,組織,*,*,*,*
EOS
I want to get readings for these as well, even if they're not perfect, so that I can get something along the lines of "アイアムアキャット".
I have already scoured the web for solutions and whereas I do find a bunch of web sites which have transliteration that appears to be adequate, I can't find any way to do it in my own code. In a couple of cases, I emailed the site authors and got no response yet after waiting for a few weeks. (Just how far behind on their inboxes are these people?)
There are a number of directions I can go but I hit dead ends on all of them so far, so this is my compound question:
- MeCab takes custom dictionaries. Is there a custom dictionary which fills in the English knowledge somewhat?
- Is there some other library or tool that can take English and spit out Katakana?
- Is there some library or tool that can take IPA (International Phonetic Alphabet) and spit out Katakana? (I know how to get from English to IPA.)
As an aside, I find that the software "VOICEROID" can speak English text (poorly, but adequately for my purposes). This software uses MeCab too (or at least its DLL and dictionary files are included in the install.) It also uses another library, Cabocha, which as far as I can tell by running it does the exact same thing as MeCab. It could be using custom dictionaries for either of these two libraries to do the job, or the code to do it could be in the proprietary AITalk library they are using. More research is needed and I haven't figured out how to run either tool against their dictionaries to test it out directly either.