I am trying to process a large csv file that contains Greek words. A sample of the data:
Three column table, Greek (unicode) words in first and third column
Or:
FileID Word Num Normalized Normalized POS Lemma MorphoFeats w/n/naw Element
susi0011 2 Θεόδορος Θεόδορος PROPN Θεύδωρος Case=Nom|Gender=Masc|Number=Sing naw <orig xml:id="susi0011-2" xml:lang="grc">Θεόδορος</orig>
susi0012 2 Σιμονίου Σιμονίου PROPN Σιμονύος Case=Gen|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-2" xml:lang="grc">Σιμονίου</orig>
susi0012 3 πρεσβίτερος πρεσβίτερος ADJ πρέσβυς Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">πρεσβίτερος</orig>
I read it in with the simple:
df=pd.read_csv('myfile.csv',encoding='utf-8')
I also tried:
with open ('myfile.csv',encoding='utf-8') as f:
df=pd.read_csv(f)
Yet, when I use df.head()
(in both cases) my Greek words come out as ???. Thinking that this might be more of a display issue, I also tried writing the dataframe back out as a csv (both with and without an encoding parameter) but the Greek was also lost. It looks something like this in the output:
FileID Word Num Normalized Normalized.1 POS Lemma MorphoFeats w/n/naw Element
0 susi0011 2 ???????? ???????? PROPN ???????? Case=Nom|Gender=Masc|Number=Sing naw <orig xml:id="susi0011-2" xml:lang="grc">?????...
1 susi0012 2 ???????? ???????? PROPN ???????? Case=Gen|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-2" xml:lang="grc">?????...
2 susi0012 3 ??????????? ??????????? ADJ ??????? Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing naw <orig xml:id="susi0012-3" xml:lang="grc">?????...
Any suggestions?