-1

I am using Spyder through the Anaconda bundle on a Macbook and keep getting this error when I use the below commands

import pandas as pd 

file = ('/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv')
df = pd.read_csv(file)
print(df.head) 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 87: invalid continuation byte

Sorry if this is a duplicate -- I googled and youtube'd and even stackflowed the crap out of this question but I cant seem to figure this out. Can you please help this newbie?

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Welcome to Stack Overflow! You can [take the tour](http://stackoverflow.com/tour) first and learn [How to Ask a good question](http://stackoverflow.com/help/how-to-ask) and create a [Minimal, Complete, and Verifiable](http://stackoverflow.com/help/mcve) example. That makes it easier for us to help you. – Stephen Rauch Jan 15 '18 at 05:24
  • It seems to be saying that the CSV file is not formatted correctly. – Barmar Jan 15 '18 at 05:25
  • Do you know the encoding of the file? You should open the file with that encoding, i.e: `pd.read_csv(file, encoding="utf-8")` – umutto Jan 15 '18 at 05:31
  • If you can show the snippet of the file you are having problems with, we can perhaps help you figure out what encoding the file is actually using. There is a brief guideline in the [Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info). Using a legacy 8-bit encoding like cp-1252 will certainly decode the file into *something* without any *explicit* errors, but if the encoding isn't the correct one, you are basically producing garbage. – tripleee Jan 15 '18 at 05:54

2 Answers2

2

If the file you are trying to process is https://github.com/bruno78/python-capstone-project/blob/master/mj-1982-2016-US-mass-shootings.csv there is a spurious ghost byte on line 55 which needs to be removed in order for the file to be properly decoded.

Line 55 describes the Trolley Square shooting so there is a third-party source (viz. Wikipedia) where you can verify the correct orthography of the shooter's name.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • The ghost byte, or rather, ghost code point is not 0xd1 though, and it's not in fact invalid Unicode anyway, so not really sure what's up with this. See also https://stackoverflow.com/questions/46180610/python-3-unicodedecodeerror-how-do-i-debug-unicodedecodeerror – tripleee Jan 16 '18 at 10:06
  • Looking at it now, this is beginning to look suspiciously like https://stackoverflow.com/questions/48259515/decoding-issue-while-parsing-json-python – tripleee Jan 16 '18 at 21:07
-1
import pandas as pd
file = '/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv'
data = pd.read_csv(file, encoding='utf-8')

Try this.

This is because the encoding the file is utf-8. The default encoding is ascii.

Lennon liu
  • 39
  • 4