0

I am trying to process a large csv file that contains Greek words. A sample of the data:

Three column table, Greek (unicode) words in first and third column

Or:

FileID  Word Num    Normalized  Normalized  POS Lemma   MorphoFeats w/n/naw Element
susi0011    2   Θεόδορος    Θεόδορος    PROPN   Θεύδωρος    Case=Nom|Gender=Masc|Number=Sing    naw <orig xml:id="susi0011-2" xml:lang="grc">Θεόδορος</orig>
susi0012    2   Σιμονίου    Σιμονίου    PROPN   Σιμονύος    Case=Gen|Gender=Masc|Number=Sing    naw <orig xml:id="susi0012-2" xml:lang="grc">Σιμονίου</orig>
susi0012    3   πρεσβίτερος πρεσβίτερος ADJ πρέσβυς Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">πρεσβίτερος</orig>

I read it in with the simple:

df=pd.read_csv('myfile.csv',encoding='utf-8')

I also tried:

with open ('myfile.csv',encoding='utf-8') as f:
  df=pd.read_csv(f)

Yet, when I use df.head() (in both cases) my Greek words come out as ???. Thinking that this might be more of a display issue, I also tried writing the dataframe back out as a csv (both with and without an encoding parameter) but the Greek was also lost. It looks something like this in the output:

FileID  Word Num    Normalized  Normalized.1    POS Lemma   MorphoFeats w/n/naw Element
0   susi0011    2   ????????    ????????    PROPN   ????????    Case=Nom|Gender=Masc|Number=Sing    naw <orig xml:id="susi0011-2" xml:lang="grc">?????...
1   susi0012    2   ????????    ????????    PROPN   ????????    Case=Gen|Gender=Masc|Number=Sing    naw <orig xml:id="susi0012-2" xml:lang="grc">?????...
2   susi0012    3   ??????????? ??????????? ADJ ??????? Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">?????...

Any suggestions?

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
M. Satlow
  • 3
  • 4
  • Please [edit] your question and improve your [mcve]. Share output from `df.head()` (copy & paste plain text). – JosefZ Mar 11 '22 at 18:52
  • Can you try this? Just to make sure that your problem is based on encoding https://stackoverflow.com/a/66062975/5987487 – Mahbubur Rahman Mar 11 '22 at 19:13
  • Mahbubur, that is an interesting routine! Ran it and it did not render correctly in any encoding - mostly ?, but some of the encodings rendered in other characters. – M. Satlow Mar 11 '22 at 19:32
  • Please share output from `df.head()` (copy & paste **plain text**). – JosefZ Mar 11 '22 at 20:37
  • Ok, I figured out how to do that, and added it to the post. Sorry - I'm pretty new at this. – M. Satlow Mar 11 '22 at 21:20
  • 1
    This seems like a display issue with whatever environment you are using. If the file wasn't encoded in UTF-8 the `read_csv` would fail. How are you reading the file to copy/paste the correct text vs. the environment used to read and display the output of reading the CSV with Python? – Mark Tolonen Mar 11 '22 at 21:42
  • I was wondering about that, but when I write to csv the output file still doesn’t render like the input file, so I’m confused. – M. Satlow Mar 11 '22 at 21:57
  • Please share `list(df['Lemma'])[0].encode()` (should be the same as `'Θεύδωρος'.encode()` which outputs `b'\xce\x98\xce\xb5\xcf\x8d\xce\xb4\xcf\x89\xcf\x81\xce\xbf\xcf\x82'`). Otherwise, share `open('myfile.csv','rb').read()[0:200]`. – JosefZ Mar 12 '22 at 10:07
  • They are not the same. list(df['Lemma'])[0].encode() outputs: b'????????'. The output of open('myfile.csv','rb').read()[0:200] is: b'FileID,Word Num,Normalized,Normalized,POS,Lemma,MorphoFeats,w/n/naw,Element\r\nsusi0011,2,????????,????????,PROPN,????????,Case=Nom|Gender=Masc|Number=Sing,naw," – M. Satlow Mar 13 '22 at 15:29
  • So your file was created in a wrong way. Export from Excel table? Or download from somewhere? – JosefZ Mar 13 '22 at 17:22
  • I think I see the problem now. When I open and manipulate my Unicode csv file in Excel 2016, there is a problem exporting it out into a Unicode csv. I have tried a lot of different approaches now and can’t figure out how to get around this, so I’ll just move the original csv file directly into pandas and manipulate it there. Thank you for the help. – M. Satlow Mar 14 '22 at 20:38

0 Answers0