R tm package: utf-8 text

Question

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).

The text is displayed absolutely right in inspect function of the tm package. However, when I search for word frequency everything is displayed incorrectly:

The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.

Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.

Let a sample text be:

Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.

My simple code is this: (Based on onertipaday.blogspot.com tutorials:)

require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

1  2 
44  4 

findFreqTerms(ap.tdm, lowfreq=2)

[1] "<U+04D9>лем"            "арман"                  "еді"                   
[4] "м<U+04D9><U+04A3>гілік"

Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.

Highly appreciate any help! :)

score 7 · Accepted Answer · answered Jan 21 '14 at 07:52

7

The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).

scan_tokenizer function (x) { scan(text = x, what = "character", quote = "", quiet = TRUE) }

One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:

scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))

Then you get the result well encoded:

findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"

answered Jan 21 '14 at 07:52

agstudy

119,832
17
199
261

@Asayat you are welcome! just I am curious , what is the text in the question is about? very hard to translate kazakh language! – agstudy Jan 21 '14 at 08:00
1

it's a part of the annual speech of our president. That paragraph states about our nation's dream to be a part of free and developed world to come true, specied with too many fancy words :) awkwardly, in the end he says that our main national idea should be somekind of "Eternal Country", it sounds weird if you take in mind that we had only one president since our independence and he might mean some kind of different eternity :) – Asayat Jan 21 '14 at 08:16
1

Just wanted to analyse annual speeches in different period of time and see the main point of his ideas :) It's also interesting to compare world cloulds of the same text but in different languages, in this case russian and kazakh – Asayat Jan 21 '14 at 08:22

score 5 · Answer 2 · edited May 23 '17 at 12:34

Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :

Encoding(text)  <- "UTF-8"

Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.

text  <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."

Encoding(text)
# [1] "unknown"

Encoding(text)  <- "UTF-8"
# [1] "UTF-8"

ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))

content(ap.corpus[[1]])

ap.tdm <- TermDocumentMatrix(ap.corpus)

ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)

print(table(ap.d$freq))
# 1  2  3 
# 62  5  1 

print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман"    "біз"      "еді"      "әлем"     "идеясы"   "мәңгілік"

It worked for me, hope it does for you too.

R tm package: utf-8 text

2 Answers2

Linked