1

I am unfortunately in this small dilemma and cannot seem to get past it. I have a .csv file that is pipe separated. Let's assume that the first few line of my file (simply titled as 'data.csv') are the following:

heading_a|heading_b|heading_c|heading_d
42|FOO|BAR|2017-09-30
0|AAB|DC|2017-09-30
101|BBA|BC|2017-09-30
...

Now to read in this data into a pandas dataframe I used the following command:

my_df = pd.read_table("data.csv", sep='|', index_col=False,
             names=["heading_a","heading_b","heading_c","heading_d"])

And for some strange reason I am getting the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 19: invalid start byte

Following from the answer to this question "UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte" I assume that it is necessary for me to encode my data into utf-8 prior to reading it into a data frame, but I cannot seem to figure out how to do so; thus, if I could get any pointers on how it would be possible to do that, or some better solution to fixing my error, that would be greatly appreciated :)

DGav
  • 271
  • 3
  • 14
  • 1
    Try passing `encoding='latin'` to `pd.read_table`. – ayhan Jan 08 '18 at 15:43
  • @ayhan - Yes that worked! Very interesting, thank you. I am going to look further into why this worked. – DGav Jan 08 '18 at 15:48
  • The characters in the file are encoded differently than pandas' default (`utf-8`). I encounter this regularly with old files having Turkish characters. – ayhan Jan 08 '18 at 16:00

0 Answers0