I have a question:
starting from this text example:
input_test = "أكتب الدر_س و إحفضه ثم إقرأ القصـــــــــــــــيـــــــــــدة"
I managed to clean this text using these functions:
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
return text
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
def remove_repeating_char(text):
return re.sub(r'(.)\1+', r'\1', text)
Which gives me this text as the result:
result = "اكتب الدرس و احفضه ثم اقرا القصيدة"
Now if I have have this case, how can I find the word "اقرا"
in the orginal input_test
?
The input text can be in English, too. I'm thinking of regex — but I don't know from where to start…