Tokenization of Arabic words using NLTK

Question

I'm using NLTK word_tokenizer to split a sentence into words.

I want to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء

The code I'm writing is:

import re
import nltk

lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"

wordsArray = nltk.word_tokenize(lex)
print " ".join(wordsArray)

The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters so that the output is:

"ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف ج أ ة ي خ ت ف ي .. ل د ر ج ة ا ن ي ا س و ي ن ف س ي ا د و ر ش ي ء"

Any ideas ?

What I've reached so far:

By trying the text in here, it appeared to be tokenized by letters. Also, however, other tokenizers tokenized it correctly. Does that mean that word_tokenize is for English only? Does that go for most of NLTK functions?

Does http://stackoverflow.com/questions/7386856/python-arabic-nlp help? (And a stemmer http://nltk.org/api/nltk.stem.html#module-nltk.stem.isri) — Jon Clements, Oct 23 '12 at 17:12

score 13 · Accepted Answer · answered Oct 24 '12 at 00:02

13

I always recommend using nltk.tokenize.wordpunct_tokenize. You can try out many of the NLTK tokenizers at http://text-processing.com/demo/tokenize/ and see for yourself.

answered Oct 24 '12 at 00:02

Jacob

4,204
1
25
25

what is the difference between most of those tokenizers ? and does that mean that most NLTK functions won't work with arabic ? – Hady Elsahar Oct 24 '12 at 23:23
5

The TreebankWordTokenizer is trained on wall street journal text, which is ascii, so it never works well on unicode text. The PunctWordTokenizer is trained on more variety of text, but I find that it's less predictable than the rest of them, which use regular expressions, making them usable on any language, with predictable results. – Jacob Oct 25 '12 at 02:02
1

NLTK in general works just fine with arabic, and any unicode text, it's just that some models expect ascii, and therefore don't do well with unicode. – Jacob Oct 25 '12 at 02:03

score 2 · Answer 2 · answered May 17 '17 at 15:17

this is the output i get with my code, but i recall unicode doesn't go well in python 2 and I used 3.5

nltk.word_tokenize('في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء ')

['في_بيتنا', 'كل', 'شي', 'لما', 'تحتاجه', 'يضيع', '...', 'ادور', 'على', 'شاحن', 'فجأة', 'يختفي', '..لدرجة', 'اني', 'اسوي', 'نفسي', 'ادور', 'شيء']

score 0 · Answer 3 · answered Dec 05 '22 at 15:22

    import nltk
    nltk.download('punkt')
    st = 'في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء.... '
    print(nltk.word_tokenize(st))

['في_بيتنا', 'كل', 'شي', 'لما', 'تحتاجه', 'يضيع', '...', 'ادور', 'على', 'شاحن', 'فجأة', 'يختفي', '..', 'لدرجة', 'اني', 'اسوي', 'نفسي', 'ادور', 'شيء', '....']

Sirouan Nouriddine · Answer 4 · 2022-09-02T08:40:06.707

-1

import nltk

nltk.download('punkt')

text = 'أسلوب المقاولات أغلى وأكثر خسارة لرب العمل من تشغيل العمال بالأجور اليومية العمل لكنه أكثر راحة له وأبعد عن القلق.'

nltk.word_tokenize(text)

print(nltk.word_tokenize(text))

['أسلوب', 'المقاولات', 'أغلى', 'وأكثر', 'خسارة', 'لرب', 'من', 'تشغيل', 'العمال', 'بالأجور', 'اليومية', 'العمل', 'لكنه', 'أكثر', 'راحة', 'له', 'وأبعد', 'عن', 'القلق', '.']

edited Sep 02 '22 at 08:40

answered Sep 02 '22 at 08:16

Sirouan Nouriddine

1
1

4

Why does this code block work? How is it any different from the other answers? Please [edit] your answer to include that explanation. – rene Sep 02 '22 at 08:19

Tokenization of Arabic words using NLTK

4 Answers4

Linked