Can't figure out output character encoding for MeCab

Asked Jul 07 '20 at 06:38

Active Jul 07 '20 at 06:38

Viewed 372 times

I'm trying to parse some Japanese text, and I can't seem to figure out the output encoding.

This is the output I'm getting:

これは ̾��,����,*,*,*,*,*
本   ̾��,����,*,*,*,*,*
です  ̾��,����,*,*,*,*,*
。   ̾��,������³,*,*,*,*,*
EOS

Steps I took:

git clone https://github.com/taku910/mecab
cd mecab/mecab
./configure --enable-utf8-only --with-charset=utf8
make
sudo make install
mecab -o ~/Desktop/output.txt ~/Desktop/input.txt, where input.txt contains "これは本です。"

Using OSX 10.15.3

asked Jul 07 '20 at 06:38

e-e

What does `mecab -D` ("show dictionary information and exit") say? In my case, `mecab -D` says it will use IPADIC with `charset: utf8` and I'm curious if in your case default dictionary isn't UTF-8. – Ahmed Fasih Jul 09 '20 at 02:46
`charset: euc-jp`, so clearly not what I want – e-e Jul 11 '20 at 03:14
Great, so MeCab is fine, just reinstall the dictionary with utf-8. If you have Homebrew, you can just `brew install mecab-unidic` etc. (though they don't have the latest Unidic). – Ahmed Fasih Jul 11 '20 at 16:50

0 Answers0