1

I've scraped Japanese contents from online to conduct content analysis. Now I am preparing the text data, starting with creating term-document matrix. The package I am using to clean and parse things out is "RMeCab". I've been told that this package requires text data to be in ANSI encoding. But my data is in UTF-8 encoding, as is the setting of RMeCab and the global setting within R itself.

Is it necessary that I change the encoding of my text files in order to run RMeCab? In that case, how do I convert the encoding of tens of thousands of separate text files quickly?

I tried encoding conversion websites, which give me some gibberish as an ANSI output. I do not understand the mechanism behind inputting something that looks like a bunch of question marks into RMeCab. If I successfully converted encoding to ANSI and my text data look like a bunch of symbols, would RMeCab still be able to read it as Japanese text?

IYP
  • 11
  • 1
  • MeCab can be compiled either for UTF8 or Shift-JIS (most likely what "ANSI" means here, see http://stackoverflow.com/a/8468126/500207 'ANSI is MS terminology for "whatever the default legacy encoding is on this computer"' which is CP932 aka SJIS for Japanese locale). If your RMeCab is set to UTF8, then I suspect it will be using a UTF8-compiled version of MeCab under the hood, and so work just fine with UTF8 text files. If you really need to convert encodings, `iconv` command-line utility is your encodings Swiss-army chainsaw. – Ahmed Fasih Nov 08 '14 at 08:17
  • Please tell us your operating system. – Ahmed Fasih Nov 08 '14 at 08:21
  • Actually it looks like `iconv` is also an R package: http://stackoverflow.com/a/7482255/500207 (again, only needed if your MeCab is SJIS-compiled, which I can double-check as soon as I know your OS) – Ahmed Fasih Nov 08 '14 at 08:23
  • 1
    Thank you so much!!! I converted the encoding using writeLines(). – IYP Nov 19 '14 at 08:15
  • Glad you got it! Can you post this as an answer and accept it? Happy NLP'ing! – Ahmed Fasih Dec 26 '14 at 19:19

0 Answers0