0

I have been trying to figure out how to get a UTF-8 CSV that I downloaded into a DataFrame. So far I have tried

df = pd.read_csv('myfile.csv', encoding='utf8')

and it gives me garbage. I am having success reading it in with

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

as suggested in this post

Reading a UTF8 CSV file with Python

but it reads in this gigantic file and I cannot get it into a DataFrame.

I'm using python 3. Thanks for helping!

My specific error output is

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid start byte'

And the file I am trying to work is one of the YEARLY CSV files downloaded from this link (not WEEKLY, I am not sure if weekly is a different format)

https://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=2&index=0

MissBleu
  • 175
  • 2
  • 15
  • Your first line (df = ...) should work. Can you be more specific about "garbage" you may just need to add another parameter for the data to parse correctly – Joe Jan 31 '18 at 18:41
  • Can you post a link to the file, some lines from it, or an example of the "garbage" that pandas gives you? – Evan Jan 31 '18 at 19:27
  • thanks, I posted my error and a link to the download – MissBleu Jan 31 '18 at 20:37

1 Answers1

0

I fixed it thanks to the post at this question

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I thought I would try the fix that they suggested

df = pd.read_csv('myfile.csv', encoding='cp1252')

and it worked! It's Windows codepage 1252... not utf-8

MissBleu
  • 175
  • 2
  • 15