Using range in regex for Arabic letters

Question

When using Regex in Python, it's easy to use brackets to represent a range of characters a-z, but this doesn't seem to be working for other languages, like Arabic:

import re
pattern = '[ي-ا]'
p = re.compile(pattern)

This results in a long error report that ends with

raise error("bad character range")
sre_constants.error: bad character range

how can this be fixed?

The end range character is at code point 1575 (decimal), while the start range is at code point 1610 (decimal), which explains the error you are having. — nhahtdh, Dec 29 '14 at 09:06
BTW: See [this](https://stackoverflow.com/a/50018691/8291949) answer how to properly validate Persian/Farsi. — wp78de, Jun 14 '18 at 18:09

score 11 · Answer 1 · answered Dec 29 '14 at 09:13

11

Since Arabic character is rendered from right to left, the correct string below, which reads "from ا to ي" is rendered backward (try to select the string if you want to confirm):

'[ا-ي]'

Console output:

>>> re.compile('[ا-ي]')
<_sre.SRE_Pattern object at 0x6001f0a80>

>>> re.compile('[ا-ي]', re.DEBUG)
in
  range (1575, 1610)
<_sre.SRE_Pattern object at 0x6001f0440>

So your pattern '[ي-ا]', is actually "from ي to ا", which is an invalid range, since the code point of ا is smaller than code point of ي.

To prevent confusion, Ignacio Vazquez-Abrams's suggestion of using Unicode escape is a good alternative to the solution I provide above.

answered Dec 29 '14 at 09:13

nhahtdh

55,989
15
126
162

While the comment about right-to-left range is correct, the answer is probably misleading for everybody who wants to validate/prase Farsi: Do not use `[ا-ي]` ! See the canonical answer by [revo](https://stackoverflow.com/users/1020526/revo) how to [properly validate Persian/Farsi](https://stackoverflow.com/a/50018691/8291949). – wp78de Jun 14 '18 at 17:19
A bit more general: `[ءؤئإآأا-ي]+` – 989 Jul 08 '20 at 08:27

score 8 · Accepted Answer · edited Dec 29 '14 at 09:08

8

Use Unicode escapes instead.

>>> re.compile('[\u0627-\u064a]')
<_sre.SRE_Pattern object at 0x237f460>

edited Dec 29 '14 at 09:08

simonzack

19,729
13
73
118

answered Dec 29 '14 at 09:04

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Nehal · Answer 3 · 2021-05-18T17:01:18.870

2

The approved answer does work, however the unicode [\u0627-\u064a] does not include variations of the letters 'ا' such as 'أ', 'آ' or 'إ', and the letter 'و' with its' variation 'ؤ'. (I wanted to comment/suggest-edit to the approved answer but there's a queue)

So in case someone is (re)visiting this question and needs those letter variations, a unicode that worked better for me was [\u0600-\u06FF], making the answer:

pattern = re.compile('[\u0600-\u06FF]')

edited May 18 '21 at 17:01

answered May 13 '21 at 02:17

Nehal

81
4

i think ur the best one – Zaman Feb 08 '23 at 10:05

Using range in regex for Arabic letters

3 Answers3

Linked

Related