I am a new user and I just want to get help with my work on R. i am doing Arabic text mining and I would love to have some help anyone have experience in this fields. So far I felt to normalize the Arabic text and even R doesn't print the Arabic characters in the console. I am stuck now and I don’t know is it right to change the language like doing the mining in Weka or any other way. Can anyone advise me if anyone achieved anything in mining Arabic text using R?
By the way I am working on Arabic tweets data set analysis. It took my one month to fetch the data. And I don’t know how long will take me to pre-processing the text.
Asked
Active
Viewed 3,234 times
4

cecilia
- 53
- 1
- 3
-
StackOverflow is for specific programming questions, not general networking. Your question to just too broad at this point. Please try to edit to make it focused on a single programming task. Include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) showing the problem you are having. If you are most concerned about the values showing up in R, state what OS you are using and what R version and GUI you are running. – MrFlick Sep 04 '14 at 03:15
1 Answers
2
I don't have much experience in this area, but I do not have problems with Arabic characters when I try this:
require(tm)
require(tm.plugin.webmining)
require(SnowballC)
corpus <- WebCorpus(GoogleNewsSource("سلام"))
corpus
inspect(corpus)
tdm <- TermDocumentMatrix(corpus)
Make sure to install the proper fonts on your OS and IDE.
```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```
In OS X Arabic characters are properly represented :
```{r}
str(myCorpus_df[1:2])
```
List of 2
$ 1:List of 2
..$ content: chr "The CHRONICLE EYE Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ 2:List of 2
..$ content: chr "RT @######## جبهة النصرة مهاجرينها وأنصارها مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "2"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
When I check the encoding of an Arabic word on the both OS (OS X and Win 7), it seems to be well coded :
```{r}
Encoding("لمياه_و_الإصحا")
```
[1] "UTF-8"
This may also be helpful: Reading arabic data text in R and plot()

Hack-R
- 22,422
- 14
- 75
- 131
-
Thanks a lot for your help. Indeed I was going to use my mac book today (mac OS) and see the result. I used TM and snowballC package but I haven’t used the (tm.plugin.webmining ) I hope that will help. Still there’s a lot to do in normalizing Arabic text have you tried to do it? Have succeeded with using R. my dissertation and I am limited with my time I just need to know if any one done such mining in R. I will give it another week and see if I didn't success I will probably choose any other language that is more secure to finish my work by the deadline. Your replay much appreciated – cecilia Sep 04 '14 at 11:59
-
Glad that I was able to help a little :) Unfortunately, no, I haven't had any experience with normalizing Arabic text. I think this is an extremely interesting question though and I encourage you to try to recruit help from different fields since this is for your dissertation. For instance, maybe you should go on to Freenode IRC in language, machine learning, and Arabic chat rooms and tell people about the project you're working on. Perhaps send them a link to this question. Also try asking for additional help from Arabic language forums like www.proz.com/forum/arabic-45.html – Hack-R Sep 04 '14 at 13:06
-
3`corpus <- WebCorpus(GoogleNewsSource("سلام"))` is throwing errors. – Manoj Kumar Nov 24 '16 at 16:48