4

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:

text = re.sub(r'[^ء-ي0-9]',' ',text)

It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!

I read the dataset using Panda (python package) like:

train = pd.read_excel('d.xlsx', encoding='utf-8')

Just to show you in a picture, I tested on Pythex site: enter image description here

What is the problem?

------------------ Edited:

The sentences in the example:

انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو يعزو احرقو العزا -- احسنلكم والله #مصر

ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين

Community
  • 1
  • 1
Minions
  • 5,104
  • 5
  • 50
  • 91

2 Answers2

5

Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.

Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).

This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.

Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:

Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)

and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:

[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]

Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:

[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]
Jongware
  • 22,200
  • 8
  • 54
  • 100
2

I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y] This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]