2

As a new person to R, I study a lot of tutorials, currently working on word clouds.

I suffer from a common R encoding disease: utf-8 text is not displayed as expected.

I am trying to create a word cloud on a text massive in .txt file (in Ukrainian, utf-8 encoding) and my cloud is completely wrong :(.

My code, the part where I state the encoding:

text <- readLines(file.choose())
Encoding(text)  <- "UTF-8"
docs <- Corpus(VectorSource(text))
inspect(docs)

The text is displayed as expected in the console (in Ukrainian, with all special symbols).

However, when I create a matrix and then a dataframe, the output has wrong encoding:

 dtm <- TermDocumentMatrix(docs)
 m <- as.matrix(dtm)
 v <- sort(rowSums(m),decreasing=TRUE)
 d <- data.frame(word = names(v),freq=v)
 head(d, 10)

What I see in the console:

> head(d, 10)
    word freq
РЅР  РЅР 1856
СЃС  СЃС 1668
СЂР  СЂР 1576
РЅС  РЅС 1162
РІР  РІР 1119
РґР  РґР 1112
РјР  РјР  994
РѕР  РѕР  857
РєС  РєС  809
РёС  РёС  788

I tried to change the locale and some other stuff I found on StackOverFlow, but nothing seems to work.

What could be the problem? What am I not seeing/getting?

Thanks!

10tons
  • 21
  • 1

0 Answers0