What am I doing wrong when I am trying to read my csv file in python?

Question

I am using Spyder through the Anaconda bundle on a Macbook and keep getting this error when I use the below commands

import pandas as pd 

file = ('/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv')
df = pd.read_csv(file)
print(df.head)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 87: invalid continuation byte

Sorry if this is a duplicate -- I googled and youtube'd and even stackflowed the crap out of this question but I cant seem to figure this out. Can you please help this newbie?

Welcome to Stack Overflow! You can [take the tour](http://stackoverflow.com/tour) first and learn [How to Ask a good question](http://stackoverflow.com/help/how-to-ask) and create a [Minimal, Complete, and Verifiable](http://stackoverflow.com/help/mcve) example. That makes it easier for us to help you. — Stephen Rauch, Jan 15 '18 at 05:24
It seems to be saying that the CSV file is not formatted correctly. — Barmar, Jan 15 '18 at 05:25
Do you know the encoding of the file? You should open the file with that encoding, i.e: `pd.read_csv(file, encoding="utf-8")` — umutto, Jan 15 '18 at 05:31
If you can show the snippet of the file you are having problems with, we can perhaps help you figure out what encoding the file is actually using. There is a brief guideline in the [Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info). Using a legacy 8-bit encoding like cp-1252 will certainly decode the file into *something* without any *explicit* errors, but if the encoding isn't the correct one, you are basically producing garbage. — tripleee, Jan 15 '18 at 05:54

score 2 · Answer 1 · answered Jan 15 '18 at 06:07

2

If the file you are trying to process is https://github.com/bruno78/python-capstone-project/blob/master/mj-1982-2016-US-mass-shootings.csv there is a spurious ghost byte on line 55 which needs to be removed in order for the file to be properly decoded.

Line 55 describes the Trolley Square shooting so there is a third-party source (viz. Wikipedia) where you can verify the correct orthography of the shooter's name.

answered Jan 15 '18 at 06:07

tripleee

175,061
34
275
318

The ghost byte, or rather, ghost code point is not 0xd1 though, and it's not in fact invalid Unicode anyway, so not really sure what's up with this. See also https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror – tripleee Jan 16 '18 at 10:06
Looking at it now, this is beginning to look suspiciously like https://stackoverflow.com/questions/48259515/decoding-issue-while-parsing-json-python – tripleee Jan 16 '18 at 21:07

score -1 · Answer 2 · answered Jan 15 '18 at 05:49

-1

import pandas as pd
file = '/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv'
data = pd.read_csv(file, encoding='utf-8')

Try this.

This is because the encoding the file is utf-8. The default encoding is ascii.

answered Jan 15 '18 at 05:49

Lennon liu

39
4

No, the error specifically says that the file *isn't* valid UTF-8. – tripleee Jan 15 '18 at 05:52

What am I doing wrong when I am trying to read my csv file in python?

2 Answers2