using variables inside regex patterns in Python

Question

I'm trying to preprocess a text file that is in Persian, but the problem is that for digits, sometimes they used Arabic digits instead of Persian ones. I want to fix this using regex. Here is my snippet of code:

def preprocessing(content):
    import re
    for d in range(10):
        arabic_digit = rf"\u066{d}"
        persian_digit = rf"\u06F{d}"
        content = re.sub(arabic_digit, persian_digit, content)
    return(content)

but it gives this error message:

error: bad escape \u at position 0

I wonder how should I use variables inside the regex patterns. The weird thing is that the problem is with the second pattern (persian_digit) and when I change it to a static string, there are no errors. Thanks for your time.

Refer to [this](https://en.m.wikipedia.org/wiki/Arabic_script_in_Unicode) — , Jul 05 '21 at 09:17
Have you tried without `r` prefix? It seems it matters: https://stackoverflow.com/a/54815485/16354567 — Zebartin, Jul 05 '21 at 09:20
Thanks, @Sujay. My problem is a little technical. I want to replace r'\u0660' with r'\u06F0', then r'\u0661' with r'\u06F1' and so on and so forth. I don't understand why Python (or maybe re library) treats the two patterns with different criterium. — Mehdi Abbassi, Jul 05 '21 at 09:23
Thanks, @Zebartin. When I remove `r`, it gives this error: `arabic_digit = f"\u066{d}" SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape` — Mehdi Abbassi, Jul 05 '21 at 09:35

Mark Tolonen · Accepted Answer · 2021-07-06T05:17:57.990

chr() is the way to generate Unicode code points:

def preprocessing(content):
    import re
    for d in range(10):
        arabic_digit = chr(0x660 + d)
        persian_digit = chr(0x6f0 + d)
        content = re.sub(arabic_digit, persian_digit, content)
    return content

But, str has a built-in .translate function for making mass substitutions that is much more efficient. Give a list of characters to replace and a same-length list of new characters:

arabic_digits = ''.join([chr(i) for i in range(0x660,0x66a)])
persian_digits = ''.join([chr(i) for i in range(0x6f0,0x6fa)])
print('Arabic: ',arabic_digits)
print('Persian:',persian_digits)

# compute the translation table once
_xlat = str.maketrans(arabic_digits,persian_digits)

def preprocessing(content):
    return content.translate(_xlat)

test = '4\u06645\u06656\u0666'

print('before:',test)
print('after: ',preprocessing(test))

Output:

Arabic:  ٠١٢٣٤٥٦٧٨٩
Persian: ۰۱۲۳۴۵۶۷۸۹
before: 4٤5٥6٦
after:  4۴5۵6۶

score 1 · Answer 2 · answered Jul 05 '21 at 09:49

According to this, it is not allowed to have unknown escapes in pattern consisting of '\' in re.sub() , which is the error you come across.

What you can do is to turn the raw string back to "normal" string like this, while I am not sure if it is the best practice:

import codecs
import re

def preprocessing(content):
    for d in range(10):
        arabic_digit = codecs.decode(rf"\u066{d}", 'unicode_escape')
        persian_digit = codecs.decode(rf"\u06F{d}", 'unicode_escape')
        content = re.sub(arabic_digit, persian_digit, content)
    return content

using variables inside regex patterns in Python

2 Answers2