I'm doing the sentiment analysis for the Arabic language , I want to creat my own corpus , to do that , I collect 300 status from facebook and I classify them into positive and negative , now I want to do the tokenization of these status , in order to obain a list of words , and hen generate unigrams and bigrams, trigrams and use the cross fold validation , I'm using for the moment the nltk python, is this software able to do this task fr the arabic language or the rapis Minner will be better to work with , what do you think and I'm wondering how to generate the bigrams, trigrams and use the cross fold validation , is there any idea ??
Asked
Active
Viewed 1,100 times
2
-
1If you use the right tokenizer, NLTK can handle Arabic. See: http://stackoverflow.com/questions/13035595/tokenization-of-arabic-words-using-nltk. – verbsintransit Mar 07 '13 at 21:47
-
I have had better luck with MALLET. I agree with the comment above. The right tokenizer can handle Arabic. Once you have the text tokenized then the rest of the pipeline is unchanged. – Shane Mar 15 '13 at 22:57
1 Answers
0
Well, I think that rapidminer is very interesting and can handle this task. It contains several operators dealing with text mining. Also, it allows the creation of new operators with high fluency.

One Day
- 1
- 3