1

I am working on text Mining in R in Arabic language, and I had some problem with defining the Arabic language by R studio. I set the local Arabic as shown here:

Sys.setlocale("LC_CTYPE","arabic")

and the Arabic is showed and i can read it, but when I tried to calculate the words frequency it doesn't define the Arabic language and it convert it to some symbols.

here is my code and sample of the data:

the data:

> head(data)
                                                                                            text joy anger
1 احاطه مجلس امن اليمن يوم مهمه لغايه يجب تكون اجهزه امم متحده واضحه تجاه تسويف حوثي تزامه انسحا   2     0
2                                           فارسلنا طوفان جراد قمل ضفادع دم ايات مفصل حشرات بكمي   0     0
3          امار تمنع سفرالمسؤل يمنين اراضيهالامن ترتضيه لاجل مصلحه وبينما تطيق يمني مطاراتها وقت   0     0
4                                                       عز تاج يفتخر راس اليمن وفخر ارض مشي يمني   2     0
5                                                   اقسم عظيم تحارب اقسم عظيم سعوديه تحافظا حوثي   2     0
6                                                      قرقاش احاطه مجلس امن اليمن يوم مهمه لغايه   1     0

the code:

emotion_tweet = c(
  paste(data$text[data$anger > 0], collapse=" "),
  paste(data$text[data$joy > 0], collapse=" "))
# create corpus
corpus = Corpus(VectorSource(emotion_tweet))
# create document term matrix
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)#the emotion
# column name binding
colnames(tdm) = c('anger','joy')#column names

the tdm result all terms are symbols I cannot understand it :

> head(tdm)
         Docs
Terms     anger  joy
  طھط      4933 6115
  طھظ      2716 3039
  طھظپ       12   18
  طھظپط     411  418
  طھظپطھ      1    3
  طھظپطھط     4    2
Fatima
  • 497
  • 5
  • 21
  • how about Sys.setlocale("LC_ALL","Arabic") ? – Areza Jan 17 '19 at 09:26
  • yes it works with the original data but when I use the code to calculate the frequency the words convert to symbols, may be that because of TermDocumentMatrix or corpus – Fatima Jan 17 '19 at 09:33
  • Are you on windows? R on windows does not always play nicely with Unicode e.g. https://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell – anotherfred Jan 17 '19 at 10:44
  • yes windows, but why R read it at the first time then it can't read it ? – Fatima Jan 17 '19 at 12:56

0 Answers0