0

I'm recently using pandas to read dataframe from a CSV file. Upon calling the reading the Csv file I also have to include 'utf-8' in order to not get the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 19: invalid start byte

So I go and modify the code like this

df = pd.read_csv('file.csv',  'utf-8' )

This gets rid of the error, Is there a way to not get this error without including 'utf-8' ?

I have to include skipinitialspace=True on my code and I'm note sure how to include it in the line. The following code gives me an error

df = pd.read_csv('file.csv', skipinitialspace=True, usecols=fields, 'utf-8')
  • I never worked with pandas, but according to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) it should be: `df = pd.read_csv('file.csv', skipinitialspace=True, usecols=fields, encoding='utf-8')` – Matthias Jan 18 '20 at 11:05
  • do you know the encoding of the file though? Is it really `utf-8` ? – anky Jan 18 '20 at 11:10
  • I just used this [Here](https://stackoverflow.com/questions/37177069/how-to-check-encoding-of-a-csv-file) and showed me that I have ANSI –  Jan 18 '20 at 11:14
  • can you try using the encoding which you found. Also it is better to pass the encoding as @Matthias suggested to avoid [this](https://stackoverflow.com/questions/42163846/positional-argument-follows-keyword-argument) – anky Jan 18 '20 at 11:18

1 Answers1

1

I had a similair problem when working with lots of diferent txt files, I developed a small program to parse 50 lines of my file and detect the encoding using chardet.

it's bettter to use more lines as suggested by anky_91 so use 1000 or more.

from pathlib import Path
import chardet

files = [f for f in Path(ryour_path).glob('*.csv')] # change ext as you wish

encodings = {}
for file in files

    with open(file,'rb') as f:
        data = f.read(1000) 
    encoding=chardet.detect(data).get("encoding")
    encodings[f'{file}'] = encoding

this will give you a dictionary of file paths and encodings you can pass into a read_csv

Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • 1
    Nice, chardet performs better with more data. Hence if possible, make it read more lines :) – anky Jan 18 '20 at 11:37