6

I am trying to read a csv file into a pandas dataframe. However, the csv contains accents. I am using Python 2.7

I've ran into a UnicodeDecodeError because there is an accent in the first column. I've read up on a bunch of sites like this SO question about UTF-8 in CSV files, this blog post on CSV errors related to newlines, and this blog post on UTF-8 issues in Python 2.7.

I used answers I've found from there to try to modify my code. Originally I had:

import pandas as pd

#Create a dataframe with the data we are interested in
df = pd.DataFrame.from_csv('MYDATA.csv')
mode = lambda ts: ts.value_counts(sort=True).index[0]
cols = df['CompanyName'].value_counts().index
df['Calls'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)

Excetera. It worked, but now passing in "NÍ" and "Nê" as a customer name is giving the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 7: invalid continuation byte

I tried changing the line to df =pd.read_csv('MYDATA.csv',encoding ='utf-8') But this gives the same error.

So I tried this from the suggestions I found by researching, but it is not working either, and I am getting the same error.

import pandas as pd
import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]


reader = unicode_csv_reader(open('MYDATA.csv','rU'), dialect = csv.reader)
#Create a dataframe with the data we are interested in
df =pd.DataFrame(reader)

I feel like it should not be this difficult to read csv data into a pandas dataframe. Does anyone know of an easier way?

Edit: What is really strange is that if I delete the row with the accented characters I still get the error

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 960: invalid continuation byte.

This is strange as my test csv has 19 rows and 27 columns. But I hope that if I decode utf8 for the entire csv it will fix the problem.

Paul
  • 10,381
  • 13
  • 48
  • 86
jenryb
  • 2,017
  • 12
  • 35
  • 72
  • 1
    Please don't use `from_csv` it's not updated anymore use the top level `read_csv` please try this: `df =pd.read_csv(MYDATA.csv', encoding='utf-8')` – EdChum Jun 19 '15 at 19:21
  • Yes, I tried this as well, however I'm getting the error" AttributeError: type object 'DataFrame' has no attribute 'read_csv' " if my line is: df =pd.DataFrame.read_csv('testing2.csv',encoding ='utf-8') otherwise I get the same UnicodeDecodeError if there are two lines ra =pd.read_csv('testing2.csv',encoding ='utf-8') // df = Dataframe(ra) – jenryb Jun 19 '15 at 19:26
  • Well the error is correct there is no `read_csv` attribute for a `DataFrame` if you'd read my code carefully it shows `pd.read_csv` so `import pandas as pd df = pd.read_csv(MYDATA.csv', encoding='utf-8')` – EdChum Jun 19 '15 at 19:37
  • Yes, I tried that. It is giving me the same UnicodeDecodeError using df = pd.read_csv(MYDATA.csv', encoding='utf-8') – jenryb Jun 19 '15 at 19:48
  • The point is, is your csv file encoded in `utf-8`? See [here](https://docs.python.org/2/library/csv.html) for universal encoder/decoder in python 2.7, but you need to provide the right encoding for the file. Hope it helps. – lrnzcig Jun 21 '15 at 15:58
  • Maybe this isn't an option, but do you need to use Python 2.7? If your code is mostly compliant across the board, it may be easier to byte the bullet once and get it working for Python 3.x, which handles unicode much more cleanly. – Paul Apr 14 '16 at 01:30

2 Answers2

1

Try adding this to the top of your script:

import sys  
reload(sys) 
sys.setdefaultencoding('utf8')
GNMO11
  • 2,099
  • 4
  • 19
  • 28
-1

I know it is very annoying when we meet error in read_csv. You can try this df=pd.read_csv(filename,sep='',error_bad_lines=False). It can skip the bad lines, it can save a lot of time.

ye jiawei
  • 882
  • 7
  • 7