Trouble filtering Arabic text in CSV using Python - non-Arabic symbols in output

Question

I am trying to use Python to try to filter verses in a religous arabic text (the Quran) that contain certain words/characters. The program works fine and outputs a CSV file with filtered verses when checking for some characters but when checking for other characters it outputs strange non Arabic symbols. For example when checking for the Arabic letter "Lam" which has unicode 0x0644, the outputted csv is perfect as attached below but when using Arabic letter "Kaf" which has unicode 0x0643 I get a bunch of symbols like Ø³ÙÙˆØ±ÙŽØ©Ù Ø§Ù„ÙÙŽØ§ØªÙØÙŽØ©Ù. Thank you in advance for the help. My code:

import csv

mylist = []

with open("Arabic-Original.csv", "r", encoding="utf-8") as file:
    csvreader = csv.reader(file)
    for row in csvreader:
        mylist.append(row)

s = f'{chr(0x0644)}'
f = open("copiedverses.csv", "w", encoding="utf-8")
for i in range(len(mylist)):
    if s in mylist[i][0]:
        f.write(mylist[i][0] +"\n")
f.close()type here

Using "lam" with a Unicode value of 0x0644 I get something like: enter image description here

Using "kaf" with a Unicode value of 0x0643 I get this: enter image description here

The code works well for some letters but not for others, I tried multiple letters that are similar to each other but I still cant find out why for some letters it outputs arabic and for others it does not. Thank you.

Is it possible this is a problem loading the file in Excel, rather than writing it in Python? e.g. https://stackoverflow.com/a/60243234/765091 — slothrop, May 22 '23 at 16:34
Please [edit] your question to improve your [mcve]. In particular, [*do not* use (sole) images of code/data/errors](https://meta.stackoverflow.com/a/285557/3439404) in your [mcve]. Copy the actual text, paste it into the question, then format it as code. — JosefZ, May 22 '23 at 19:57
Hi please include a small sample of the CSV (as text, not an image) that will allow us to see the good and the bad. I imagine we’d only need the sample to have two rows: one with Lam, and one with Kaf. — Zach Young, May 22 '23 at 20:53
Also, I imagine either: 1) those weird looking characters are already in the original file, or… 2) the original file is not actually UTF-8 encoded. I’ve read your code and I cannot see anything you’ve done in the code that could transform good text into what you see. But I’m new to Arabic and RTL scripts. — Zach Young, May 22 '23 at 20:56
Also, when writing a CSV that will be read by Excel, use `utf-8-sig` instead of `utf-8`. That writes a UTF-8 BOM (byte order mark) code point as a signature that Excel uses to read the file correctly as UTF-8. It will assume a localized encoding such as Windows-1252 (US-localized Windows) or Windows-1256 (Arabic-localized Windows) instead. — Mark Tolonen, May 22 '23 at 23:07

Trouble filtering Arabic text in CSV using Python - non-Arabic symbols in output

0 Answers0