4

I'm working with unicode-formatted text in R using the text mining package tm. I'd like for unicode characters not to be destroyed when they're read into the program, but I can't find the missing keyword. Here's an example of unicode text that gets screwed up instantly upon being read as a corpus

library(tm)
u <- VectorSource("The great Chāṇakya (350–283 BC).",encoding = "UTF-8")
v <- Corpus(u)
inspect(v)
## [[1]]
## The great Chaṇakya (350–283 BC).  <--The ā has been coerced to "a"

writeCorpus(v,'test.txt')
## yields: The great Cha<U+1E47>akya (350–283 BC).

I've tried using the UTF-16 as well, with the same results. Is there a way to pass this text through tm without having it destroyed?

Michael K
  • 2,196
  • 6
  • 34
  • 52
  • 1
    One way would be to save it in an UTF-8-encoded text file and read this instead of the copy/pasted string, e.g. `inspect(Corpus(VectorSource(readLines("my.txt", n=1, encoding="UTF-8"))))`. This produces the correct output on my Windows machine. – lukeA Feb 21 '14 at 04:15
  • Hmm, I get the correct output for inspect(), but when I try to write back, I get the same output as in the question. It's a step, though, and I'll see if I can get the rest of the way. Thanks! – Michael K Feb 21 '14 at 05:56
  • So, writing to a file appears to be busted. However, the solution to http://stackoverflow.com/questions/10675360/utf-8-file-output-in-r provides a way to write utf-8 encoded text to a file properly. – Michael K Feb 21 '14 at 06:07

0 Answers0