regexp_tokenize and Arabic text

Question

I'm using regexp_tokenize() to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import  regexp_tokenize

def PreProcess_text(Input):
  tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
  return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print  '\n'.join(Cleand)

It worked fine, but the problem is when I try to print the text.

The output for the text ايمان،سعد:

    ?يم
    ?ن
    ?
    ?
    ?

but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

The output for the text hi،eman:

     hi
     eman

Its probably the fact that Arabic is printed backwards. In perl, I get output of ايمان and ،سعد — , Aug 26 '16 at 16:58
You use Python 2.x, don't you? In Python 3.4, I get `ايمان` and `سعد` when I enter `ايمان،سعد` — Wiktor Stribiżew, Aug 26 '16 at 18:35
the expected out put should be: ايمان سعد (the same as you)... yes think the problem is that i'm using python 2.7 — Eman, Aug 26 '16 at 19:00
Please use `@`+username to notify a user of your feedback. I suggest using `u` prefix: `ur'[\u060C\u061F!.\u061B]\s*'` and do not pass just `H` - try either `unicode(H, "utf-8")` or `H.decode('utf8')` — Wiktor Stribiżew, Aug 27 '16 at 08:17
Any news? Did you manage to get it to work? BTW, are you on Windows or Linux? — Wiktor Stribiżew, Aug 28 '16 at 16:03
@WiktorStribiżew first i tried using either unicode(H, "utf-8") or H.decode('utf8') but these is an error in the printing. I think the solution is to change to Python 3. if you know how on Mac that will be very helpful. THANK YOU — Eman, Aug 30 '16 at 17:21
I think you can check https://docs.python.org/3/using/mac.html. Also, I am sure you have not checked everything in Python 2.x. Check http://stackoverflow.com/questions/477061/how-to-read-unicode-input-and-compare-unicode-strings-in-python — Wiktor Stribiżew, Aug 30 '16 at 18:08
@WiktorStribiżew thank so much for your help, somehow the H.decode('utf8') worked perfectly!!! thank you again — Eman, Aug 30 '16 at 18:29

Wiktor Stribiżew · Accepted Answer · 2016-09-01T23:13:47.370

3

When you use raw_input, the symbols are coded as bytes.

You need to convert it into a Unicode string with

H.decode('utf8')

And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)

edited Sep 01 '16 at 23:13

answered Aug 30 '16 at 18:44

Wiktor Stribiżew

607,720
39
448
563

regexp_tokenize and Arabic text

1 Answers1

Linked