3

I'm using regexp_tokenize() to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import  regexp_tokenize

def PreProcess_text(Input):
  tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
  return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print  '\n'.join(Cleand) 

It worked fine, but the problem is when I try to print the text.

The output for the text ايمان،سعد:

    ?يم
    ?ن
    ?
    ?
    ? 

but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

The output for the text hi،eman:

     hi
     eman
Laurel
  • 5,965
  • 14
  • 31
  • 57
Eman
  • 33
  • 6
  • What's the expected output for your Arabic text? – NullUserException Aug 26 '16 at 16:43
  • Its probably the fact that Arabic is printed backwards. In perl, I get output of ايمان and ،سعد –  Aug 26 '16 at 16:58
  • You use Python 2.x, don't you? In Python 3.4, I get `ايمان` and `سعد` when I enter `ايمان،سعد` – Wiktor Stribiżew Aug 26 '16 at 18:35
  • the expected out put should be: ايمان سعد (the same as you)... yes think the problem is that i'm using python 2.7 – Eman Aug 26 '16 at 19:00
  • Please use `@`+username to notify a user of your feedback. I suggest using `u` prefix: `ur'[\u060C\u061F!.\u061B]\s*'` and do not pass just `H` - try either `unicode(H, "utf-8")` or `H.decode('utf8')` – Wiktor Stribiżew Aug 27 '16 at 08:17
  • Any news? Did you manage to get it to work? BTW, are you on Windows or Linux? – Wiktor Stribiżew Aug 28 '16 at 16:03
  • @WiktorStribiżew first i tried using either unicode(H, "utf-8") or H.decode('utf8') but these is an error in the printing. I think the solution is to change to Python 3. if you know how on Mac that will be very helpful. THANK YOU – Eman Aug 30 '16 at 17:21
  • I think you can check https://docs.python.org/3/using/mac.html. Also, I am sure you have not checked everything in Python 2.x. Check http://stackoverflow.com/questions/477061/how-to-read-unicode-input-and-compare-unicode-strings-in-python – Wiktor Stribiżew Aug 30 '16 at 18:08
  • @WiktorStribiżew thank so much for your help, somehow the H.decode('utf8') worked perfectly!!! thank you again – Eman Aug 30 '16 at 18:29

1 Answers1

3

When you use raw_input, the symbols are coded as bytes.

You need to convert it into a Unicode string with

H.decode('utf8')

And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563