6

When using Regex in Python, it's easy to use brackets to represent a range of characters a-z, but this doesn't seem to be working for other languages, like Arabic:

import re
pattern = '[ي-ا]'
p = re.compile(pattern)

This results in a long error report that ends with

raise error("bad character range")
sre_constants.error: bad character range

how can this be fixed?

Morteza R
  • 2,229
  • 4
  • 20
  • 31
  • 1
    The end range character is at code point 1575 (decimal), while the start range is at code point 1610 (decimal), which explains the error you are having. – nhahtdh Dec 29 '14 at 09:06
  • BTW: See [this](https://stackoverflow.com/a/50018691/8291949) answer how to properly validate Persian/Farsi. – wp78de Jun 14 '18 at 18:09

3 Answers3

11

Since Arabic character is rendered from right to left, the correct string below, which reads "from ا to ي" is rendered backward (try to select the string if you want to confirm):

'[ا-ي]'

Console output:

>>> re.compile('[ا-ي]')
<_sre.SRE_Pattern object at 0x6001f0a80>

>>> re.compile('[ا-ي]', re.DEBUG)
in
  range (1575, 1610)
<_sre.SRE_Pattern object at 0x6001f0440>

So your pattern '[ي-ا]', is actually "from ي to ا", which is an invalid range, since the code point of ا is smaller than code point of ي.

To prevent confusion, Ignacio Vazquez-Abrams's suggestion of using Unicode escape is a good alternative to the solution I provide above.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • While the comment about right-to-left range is correct, the answer is probably misleading for everybody who wants to validate/prase Farsi: Do not use `[ا-ي]` ! See the canonical answer by [revo](https://stackoverflow.com/users/1020526/revo) how to [properly validate Persian/Farsi](https://stackoverflow.com/a/50018691/8291949). – wp78de Jun 14 '18 at 17:19
  • A bit more general: `[ءؤئإآأا-ي]+` – 989 Jul 08 '20 at 08:27
8

Use Unicode escapes instead.

>>> re.compile('[\u0627-\u064a]')
<_sre.SRE_Pattern object at 0x237f460>
simonzack
  • 19,729
  • 13
  • 73
  • 118
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
2

The approved answer does work, however the unicode [\u0627-\u064a] does not include variations of the letters 'ا' such as 'أ', 'آ' or 'إ', and the letter 'و' with its' variation 'ؤ'. (I wanted to comment/suggest-edit to the approved answer but there's a queue)

So in case someone is (re)visiting this question and needs those letter variations, a unicode that worked better for me was [\u0600-\u06FF], making the answer:

pattern = re.compile('[\u0600-\u06FF]')
Nehal
  • 81
  • 4