3

I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly.

Here is a reproducible example of my problem:

Load in the Russian text:

> data <- c("Renault Logan, 2005","Складское помещение, 345 м²",
          "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")

Create a VectorSource:

> vs <- VectorSource(data)
> vs # outputs correctly

Then, create the corpus:

> corp <- Corpus(vs)
> inspect(corp) # output is not encoded properly

The output that I get is:

> inspect(corp)
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
Renault Logan, 2005

[[2]]
<<PlainTextDocument (metadata: 7)>>
Ñêëàäñêîå ïîìåùåíèå, 345 ì<U+00B2>

[[3]]
<<PlainTextDocument (metadata: 7)>>
Ñó-øåô

[[4]]
<<PlainTextDocument (metadata: 7)>>
3-ê êâàðòèðà, 64 ì<U+00B2>, 3/5 ýò.

[[5]]
<<PlainTextDocument (metadata: 7)>>
Samsung galaxy S4 mini GT-I9190 (÷¸ðíûé)

Why does it output incorrectly? There doesn't seem to be any option to set the encoding on the Corpus method. Is there a way to set it after the fact? I have tried this:

> title_corpus <- tm_map(title_corpus, enc2utf8)
Error in FUN(X[[1L]], ...) : argumemt is not a character vector

But, it errors as shown.

user1477388
  • 20,790
  • 32
  • 144
  • 264
  • 1
    I cannot replicate. When i run `inspect` it looks the same as it does in `data`. What version of `tm` and R are you using (`sessionInfo()` should tell you both). – MrFlick Jul 23 '14 at 20:38
  • @MrFlick I am using `R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit)` and `tm_0.6`. I am on Windows. – user1477388 Jul 23 '14 at 20:41
  • OK. I'm at a computer with 3.0.2 so I can't get the latest version of `tm`. But the Corpus doesn't have an encoding parameter but VectorSource should. What happens with `VectorSource(data, encoding="UTF-8")`. Anything different? – MrFlick Jul 23 '14 at 20:44
  • It says, `Error in VectorSource(raw_test$title, encoding = "UTF-8") : unused argument (encoding = "UTF-8")` – user1477388 Jul 23 '14 at 20:47
  • Ugh. It does seem to be removed from the latest version of `tm` for some reason. One last shot, `VectorSource(enc2utf8(data))`. Maybe I can try on my other computer if that doesn't work. – MrFlick Jul 23 '14 at 20:51
  • Seems to have no effect. – user1477388 Jul 23 '14 at 20:53

3 Answers3

7

Well, there seems to be good news and bad news.

The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at

content(corp[[2]])
# [1] "Складское помещение, 345 м²"

The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.

Try going a step further and building a DocumentTermMatrix

dtm <- DocumentTermMatrix(corp)
Terms(dtm)

Hopefully you will see (as I do) the words correctly displayed.

If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    Perhaps the maintainer should be notified. +1 – Tyler Rinker Jul 24 '14 at 03:45
  • You're correct, if I create a DTM it appears to be fine. Thank you for your guidance. – user1477388 Jul 24 '14 at 13:10
  • I am having a smilar issue, i use `docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))`, and my txt files are saved with UTF-8 encoding, but i still get **scal** instead of **fiscal** and **benet** instead of benefit when i export most frequent words and in the topic distributions over words. Is there something i am doing wrong? I would very much appreciate any advice – Michael Apr 26 '20 at 23:37
  • @Michael I'm not sure what might be going on there. Perhaps the package has changed again. It would be better if you created your own question with a minimal [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to further help you. – MrFlick Apr 27 '20 at 00:18
  • @MrFlick Thank you for the suggestion. I went ahead and created a new thread. Here is the link in case you can help. thank you! https://stackoverflow.com/questions/61463661/how-to-properly-encode-utf-8-txt-files-for-r-topic-model] – Michael Apr 27 '20 at 16:36
3

I'm surprised the answer has not been posted yet. Don't bother messing with locale. I'm using tm package version 0.6.0 and it works absolutely fine, provided you add the following little piece of magic :

Encoding(data)  <- "UTF-8"

Well, here is the reproducible code :

data <- c("Renault Logan, 2005","Складское помещение, 345 м²","Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)")

Encoding(data)
# [1] "unknown" "unknown" "unknown" "unknown" "unknown"

Encoding(data)  <- "UTF-8"
# [1] "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"

Just put it in a text file saved with UTF-8 encoding, then source it normally in R. But do not use source.with.encoding(..., encoding = "UTF-8"); it will throw an error.

I forgot where I learned this trick, but I picked it up somehere along the way this past week, while surfing the Web trying to learn how to process UTF8 text in R. Things were alot cleaner in Python (just convert everything to Unicode!). R's approach is much less straighforward for me, and it did not help that documentation is sparse and confusing.

2

I had a problem with German UTF-8 encoding while importing the texts. For me, the next oneliner helped:

Sys.setlocale("LC_ALL", "de_DE.UTF-8")

Try to run the same with Russian?

Sys.setlocale("LC_ALL", "ru_RU.UTF-8")

Of course, that goes after library(tm) and before creating a corpus.

EugenieSH
  • 21
  • 2