1

I have a question: starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"

I managed to clean this text using these functions:

arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations

arabic_diacritics = re.compile("""
                             ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)


def normalize_arabic(text):
    text = re.sub("[إأآا]", "ا", text)
    return text


def remove_diacritics(text):
    text = re.sub(arabic_diacritics, '', text)
    return text


def remove_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)


def remove_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

Which gives me this text as the result:

result = "اكتب الدرس و احفضه ثم اقرا القصيدة"

Now if I have have this case, how can I find the word "اقرا" in the orginal input_test?

The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…

martineau
  • 119,623
  • 25
  • 170
  • 301
  • I don't think it's feasible to do what you want because the functions are causing a loss of information. – martineau Jan 29 '22 at 01:56
  • is theire any way to check if we can have this word in the input text – mohanad almowahid Jan 29 '22 at 10:14
  • Generally speaking, no. Because of the substitutions being done there's no way to tell if the word (presumably without any substitutions having been done to it) was in the original text or not — that's what I meant about an information loss. If you're sure the word being sought would not have been affected by any of substitutions, then you could just simply search for it in original string via `input_test.find("اقرا")`. – martineau Jan 29 '22 at 12:09
  • how about genrating a list of words where we replace all the "ا" in "اقرا" with each one in [إأآا] and after we search in in the input_test – mohanad almowahid Jan 29 '22 at 13:25
  • Sounds like it might work. – martineau Jan 29 '22 at 13:44
  • can you show me how can i generate the list with regex – mohanad almowahid Jan 29 '22 at 14:21
  • 1
    No, that wouldn't be appropriate for comments. Instead, I think you should ask a *new* question specifically on that topic (and show your own attempt to do it, of course). Hint: note that the *`repl`* argument (second one) that is passed to `re.sub()` can be a **function**. – martineau Jan 29 '22 at 15:01
  • Does this answer your question? "[Regex for accent insensitive replacement in python](/q/43634502/90527)", "[Regex - match a character and all its diacritic variations (aka accent-insensitive)](/q/35783135/90527)" – outis Aug 11 '22 at 22:20

0 Answers0