creating arabic corpus

Question

I'm doing the sentiment analysis for the Arabic language , I want to creat my own corpus , to do that , I collect 300 status from facebook and I classify them into positive and negative , now I want to do the tokenization of these status , in order to obain a list of words , and hen generate unigrams and bigrams, trigrams and use the cross fold validation , I'm using for the moment the nltk python, is this software able to do this task fr the arabic language or the rapis Minner will be better to work with , what do you think and I'm wondering how to generate the bigrams, trigrams and use the cross fold validation , is there any idea ??

If you use the right tokenizer, NLTK can handle Arabic. See: http://stackoverflow.com/questions/13035595/tokenization-of-arabic-words-using-nltk. — verbsintransit, Mar 07 '13 at 21:47
I have had better luck with MALLET. I agree with the comment above. The right tokenizer can handle Arabic. Once you have the text tokenized then the rest of the pipeline is unchanged. — Shane, Mar 15 '13 at 22:57

score 0 · Answer 1 · answered Mar 10 '13 at 07:00

0

Well, I think that rapidminer is very interesting and can handle this task. It contains several operators dealing with text mining. Also, it allows the creation of new operators with high fluency.

answered Mar 10 '13 at 07:00

One Day

1
3

creating arabic corpus

1 Answers1