3

I'm trying to parse some Japanese text, and I can't seem to figure out the output encoding.

This is the output I'm getting:

これは ̾��,����,*,*,*,*,*
本   ̾��,����,*,*,*,*,*
です  ̾��,����,*,*,*,*,*
。   ̾��,������³,*,*,*,*,*
EOS

Steps I took:

  1. git clone https://github.com/taku910/mecab
  2. cd mecab/mecab
  3. ./configure --enable-utf8-only --with-charset=utf8
  4. make
  5. sudo make install
  6. mecab -o ~/Desktop/output.txt ~/Desktop/input.txt, where input.txt contains "これは本です。"

Using OSX 10.15.3

e-e
  • 1,071
  • 1
  • 11
  • 20
  • What does `mecab -D` ("show dictionary information and exit") say? In my case, `mecab -D` says it will use IPADIC with `charset: utf8` and I'm curious if in your case default dictionary isn't UTF-8. – Ahmed Fasih Jul 09 '20 at 02:46
  • `charset: euc-jp`, so clearly not what I want – e-e Jul 11 '20 at 03:14
  • Great, so MeCab is fine, just reinstall the dictionary with utf-8. If you have Homebrew, you can just `brew install mecab-unidic` etc. (though they don't have the latest Unidic). – Ahmed Fasih Jul 11 '20 at 16:50

0 Answers0