2

I was trying to save my dataset in a CSV file with the following script:

with open(data_path+'Furough.csv', 'w',encoding="utf-8") as f0:
    df = pd.DataFrame(columns=['title','poem','year'])
    for f in onlyfiles:
        poem=[]
        title=""
        year=0
        with open(mypath+f,"r",encoding="utf-8") as f1:
            for line in f1:
                if line.__contains__("TIMESTAMP"):
                    year=int(line[12:15])
                    continue
                if line.__contains__('TITLE'):
                    title=line[7:]
                if line!="":
                    poem.append(line)
            df = df.append({
                            'title': title,
                            'poem':poem,
                            'year': int(float(year))
                            }, ignore_index=True)
            df.to_csv(f0, index=False,encoding='utf-8-sig')

but the result is confusing, write some unknown chars to CSV file instead of Farsi chars: Can anyone help me?

I want to write all these files in a CSV: enter image description here example of what I have in one of them and want to write:

[V_START] بر پرده‌های درهم امیال سرکشم [HEM]
نقش عجیب چهرۀ یک ناشناس بود [V_END]
[V_START] نقشی ز چهره‌ای که چو می‌جستمش به شوق [HEM]
پیوسته می‌رمید و بمن رخ نمی‌نمود [V_END]

[V_START] یک شب نگاه خستۀ مردی به روی من [HEM]
لغزید و سست گشت و همان ‌جا خموش ماند [V_END]
[V_START] تا خواستم که بگسلم این رشتۀ نگاه [HEM]
قلبم تپید و باز مرا سوی او کشاند [V_END]

but result:

enter image description here

Zahra Hosseini
  • 478
  • 2
  • 4
  • 14
  • 1
    Does your csv-editor normally display Farsi correctly? – Patrick Artner Jul 09 '21 at 08:29
  • I'm not sure, but there might be a problem with the output file name you mentioned in the to_csv function. I tested a similar code, and there was a problem with writing to an existing file. try a new file and give feedback plz. suggestion: put a sample input of the code similar to the output you mentioned increases the chance to find an answer sooner. – Amin Heydari Alashti Jul 09 '21 at 08:45
  • I edited my question and add more details. thank you! same problem I tried new file :( – Zahra Hosseini Jul 09 '21 at 08:54
  • 1
    What are you using to open the csv file at the end @zahraHosseini? – Cimbali Jul 09 '21 at 08:55
  • 1
    Can you try changing the utf-8s to utf-8-sig? I was going through this: https://stackoverflow.com/questions/34905380/unable-to-save-arabic-decoded-unicode-to-csv-file-using-python and it might help. – Zaid Al Shattle Jul 09 '21 at 08:57
  • Microsoft Excel! does it matter?@Cimbali – Zahra Hosseini Jul 09 '21 at 08:57
  • 1
    Edit on my previous message, mostly I think the write- open is the one that needs to be in utf-8-sig – Zaid Al Shattle Jul 09 '21 at 09:00

2 Answers2

3

It’s likely your file is correct and excel is opening it with another encoding.

Maybe inserting a utf-8 BOM could force excel to properly recognize the csv as utf-8:

import codecs

with open(data_path+'Furough.csv', 'w',encoding="utf-8") as f0:
    f0.write(codecs.BOM_UTF8)

    # rest of your code

Otherwise see this microsoft help page on how to open UTF-8 CSV file in Excel without mis-conversion?
Basically going through the “Get Data From Text” dialog which allos to specify the encoding.

Cimbali
  • 11,012
  • 1
  • 39
  • 68
3

To add to Cimbali's answer, another method to add a UTF8 BOM is by using the encoding "utf-8-sig" instead of "utf-8", as it will automatically take care of it for you.

Further information is in this question: Unable to Save Arabic Decoded Unicode to CSV File Using Python

Zaid Al Shattle
  • 1,454
  • 1
  • 12
  • 21