0

I'm trying to preprocess a text file that is in Persian, but the problem is that for digits, sometimes they used Arabic digits instead of Persian ones. I want to fix this using regex. Here is my snippet of code:

def preprocessing(content):
    import re
    for d in range(10):
        arabic_digit = rf"\u066{d}"
        persian_digit = rf"\u06F{d}"
        content = re.sub(arabic_digit, persian_digit, content)
    return(content)

but it gives this error message:

error: bad escape \u at position 0

I wonder how should I use variables inside the regex patterns. The weird thing is that the problem is with the second pattern (persian_digit) and when I change it to a static string, there are no errors. Thanks for your time.

Mehdi Abbassi
  • 627
  • 1
  • 7
  • 24
  • Refer to [this](https://en.m.wikipedia.org/wiki/Arabic_script_in_Unicode) –  Jul 05 '21 at 09:17
  • Have you tried without `r` prefix? It seems it matters: https://stackoverflow.com/a/54815485/16354567 – Zebartin Jul 05 '21 at 09:20
  • Thanks, @Sujay. My problem is a little technical. I want to replace r'\u0660' with r'\u06F0', then r'\u0661' with r'\u06F1' and so on and so forth. I don't understand why Python (or maybe re library) treats the two patterns with different criterium. – Mehdi Abbassi Jul 05 '21 at 09:23
  • Thanks, @Zebartin. When I remove `r`, it gives this error: `arabic_digit = f"\u066{d}" SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape` – Mehdi Abbassi Jul 05 '21 at 09:35

2 Answers2

2

chr() is the way to generate Unicode code points:

def preprocessing(content):
    import re
    for d in range(10):
        arabic_digit = chr(0x660 + d)
        persian_digit = chr(0x6f0 + d)
        content = re.sub(arabic_digit, persian_digit, content)
    return content

But, str has a built-in .translate function for making mass substitutions that is much more efficient. Give a list of characters to replace and a same-length list of new characters:

arabic_digits = ''.join([chr(i) for i in range(0x660,0x66a)])
persian_digits = ''.join([chr(i) for i in range(0x6f0,0x6fa)])
print('Arabic: ',arabic_digits)
print('Persian:',persian_digits)

# compute the translation table once
_xlat = str.maketrans(arabic_digits,persian_digits)

def preprocessing(content):
    return content.translate(_xlat)

test = '4\u06645\u06656\u0666'

print('before:',test)
print('after: ',preprocessing(test))

Output:

Arabic:  ٠١٢٣٤٥٦٧٨٩
Persian: ۰۱۲۳۴۵۶۷۸۹
before: 4٤5٥6٦
after:  4۴5۵6۶
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
1

According to this, it is not allowed to have unknown escapes in pattern consisting of '\' in re.sub() , which is the error you come across.

What you can do is to turn the raw string back to "normal" string like this, while I am not sure if it is the best practice:

import codecs
import re

def preprocessing(content):
    for d in range(10):
        arabic_digit = codecs.decode(rf"\u066{d}", 'unicode_escape')
        persian_digit = codecs.decode(rf"\u06F{d}", 'unicode_escape')
        content = re.sub(arabic_digit, persian_digit, content)
    return content
Zebartin
  • 124
  • 10