1

I am dealing with a dataframe that has Greek characters. They appear like that:

enter image description here

The data are here:

toy.to_json()

'{"a_a":{"0":49.0,"1":50.0,"2":52.0,"3":53.0,"4":54.0},"grade":{"0":3.0,"1":5.0,"2":4.0,"3":5.0,"4":4.0},"sex":{"0":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","1":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","2":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","3":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","4":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9"},"age":{"0":122.0,"1":125.0,"2":119.0,"3":122.0,"4":127.0},"fath_job":{"0":2.0,"1":2.0,"2":2.0,"3":2.0,"4":2.0},"phscs":{"0":49.0,"1":73.0,"2":61.0,"3":75.0,"4":59.0},"pcc":{"0":10.0,"1":26.0,"2":19.0,"3":28.0,"4":23.0},"pcg":{"0":21.0,"1":28.0,"2":20.0,"3":25.0,"4":19.0},"tasc":{"0":17.0,"1":5.0,"2":17.0,"3":8.0,"4":11.0},"class":{"0":0.0,"1":0.0,"2":0.0,"3":0.0,"4":0.0},"grade3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00ef\\u00f2","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","2":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","4":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2"},"pcc3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"tasc3":{"0":3.0,"1":1.0,"2":3.0,"3":2.0,"4":2.0},"pcg3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"phscs3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"}}'

I tried to import the file with encoding = 'utf_8' but it did not work.

Here are some other approaches I tried:

toy.to_csv('toy.csv', index = False)

import chardet

rawdata = open('toy.csv', 'rb').read()

result = chardet.detect(rawdata)

charenc = result['encoding']

pd.read_csv('toy.csv', encoding = charenc)

enter image description here

pd.read_csv('toy.csv', encoding = 'cp737')

enter image description here

halfer
  • 19,824
  • 17
  • 99
  • 186
user8270077
  • 4,621
  • 17
  • 75
  • 140
  • how about `cp737` as encoding? if it doesnot work, try [this](https://stackoverflow.com/questions/54133455/importing-csv-using-pd-read-csv-invalid-start-byte-error/54134734#54134734) solution – anky Jun 08 '19 at 16:47
  • So `pd.read_csv('toy.csv', encoding = 'cp737')` seems to work, what's the issue? – cs95 Jun 08 '19 at 17:07
  • @cs95: No it does not. What you see is not the Greek text but some incomprehensible combination of characters. – user8270077 Jun 09 '19 at 03:27

2 Answers2

1

Try something like this:

pd.read_csv('toy.csv', encoding = 'iso8859_7')

It solved the problem for me, I hope it's the same for you.

buddemat
  • 4,552
  • 14
  • 29
  • 49
Kostas
  • 11
  • 2
0

On Mac (macOS Catalina 10.15.7) this works:

gdf = gpd.read_file("./../data/processed/landmarks.shp", encoding="mac_greek")

Apparently, you have to find the right encoding for your system. A list of python encodings can be found in the documentation. However, it is still not perfect. I still have some random characters. I do not know it that is due to the data source or because the list refers to python 2.

Greek encoded geo pandas data