I'm using regexp_tokenize()
to return tokens from an Arabic text without any punctuation marks:
import re,string,sys
from nltk.tokenize import regexp_tokenize
def PreProcess_text(Input):
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
return tokens
H = raw_input('H:')
Cleand= PreProcess_text(H)
print '\n'.join(Cleand)
It worked fine, but the problem is when I try to print the text.
The output for the text ايمان،سعد
:
?يم
?ن
?
?
?
but if the text is in English, even with an Arabic punctuation marks, it prints the right result.
The output for the text hi،eman
:
hi
eman