0

I have done web scraping from a website in Arabic for the purpose of using the text as my training data for NER; Now I am trying to write each word with its associated tag (whether a named entity or not) into a CSV file, but in the CSV file the Arabic words are shown with these characters: وصلى How can I resolve this problem? I want the Arabic words to be shown in the Arabic script in my CSV file.

This is part of my code:

with open(filename, 'w', encoding='utf-8') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(fields)
    csvwriter.writerows(rows)
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Delaram R
  • 45
  • 6
  • 1
    Is "وصلى" really UTF-8? You're facing issue while scraping or saving file? I believe we can manipulate that character while scraping rather then while writing in file – imxitiz Aug 09 '21 at 00:35
  • 1
    It's probably a viewer issue. If reading the CSV with Excel, use `encoding='utf-8-sig'`. Excel assumes a localized encoding unless there is a UTF-8-encoded byte order mark (BOM) codepoint at the start of the file. `وصلى` is Arabic characters encoded in UTF-8 but decoded in Windows-1252. – Mark Tolonen Aug 09 '21 at 00:39
  • Thanks! I used encoding="utf-8-sig" and that just solved the problem :) – Delaram R Aug 09 '21 at 00:42
  • @Xitiz How do you think OP is looking at the generated CSV file to see the problem :^) – Mark Tolonen Aug 09 '21 at 00:43
  • @MarkTolonen Don't know. :P – imxitiz Aug 09 '21 at 00:44

0 Answers0