pd.read_csv not sure how to determine the encoding for my csv files

Question

I am trying to read csv files using pd.read_csv. I am running into encoding issues and I’m not sure how to proceed. The first issue I running into is the following error message caused when reading csv fiels that contain a µ character.

“SyntaxError: Non-UTF-8 code starting with '\xb5' in file GUI_Simpilify.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details”

I’m able to get past this error by manually changing the file name and removing the µ. However, this is not a solution as I have 1,000’s of csv files to extract data from.

Once I manually remove the µ from a single csv file and rerun my script I get this error message: “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 13: invalid start byte”

I believe this is due to the fact that all of my csv files contain both ± and µ characters. How can I deal with both these errors without manual solutions?

Code:

    import pandas as pd
    test_csv = pd.read_csv('OFN 0.1pg_L Split 20-1 (5 fg on column).csv')

Try: pd.read_csv(**u**'OFN 0.1pg_L Split 20-1 (5 fg on column).csv') — Anton vBR, Aug 01 '17 at 12:16
Also, have you considered not typing in names and loop the directory. Look here: https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory — Anton vBR, Aug 01 '17 at 12:17
Where do the csvs come from? Try to explicitly specify the encoding like this `pd.read_csv('filename.csv', encoding='utf8')`, instead of `utf8`, you can try `cp1250`, or `cp1252` for windows-like-encoding, or `'latin1` is quite common. Refere here for a more complete list https://docs.python.org/3/library/codecs.html#standard-encodings — redacted, Aug 01 '17 at 12:17
The csv files are exported from a chemical analyzer instrument called a Time of Filght Mass Spectrometer. Since many of the chemical names contain greek charactrers and ranges for values I'm going to have to deal with characters such as ± and µ. I can read the file using 'latin-1', but only after manually removing the µ from the file name. — kitestring, Aug 01 '17 at 12:43
Initially I did try looping the directory, because I have 1000's of csv files to load data from. I switched to typing the file names to simplify and limit possible error sources. Unfortunately, I'm beginning to think I'll have to write a script to change the file names and remove the µ character. Not the solution I was hoping for, but sometimes you just have to get things done and move forward. — kitestring, Aug 01 '17 at 13:29

score 1 · Answer 1 · answered Aug 01 '17 at 07:16

1

This error because of without specifying encoding. Add this line at the beginning your python script

# -*- coding: utf-8 -*-

answered Aug 01 '17 at 07:16

Mohamed Thasin ah

10,754
11
52
111

I tried adding the suggested line at the beginning of my code, and it made no difference. I'm still getting the same error message. "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 13: invalid start byte" – kitestring Aug 01 '17 at 12:14
However, it did solve the first problem. Python is now ok dealing with file names that contain the µ character. Before I was getting this error when I had file names containing such characters, "SyntaxError: Non-UTF-8 code starting with '\xb5' in file GUI_Simpilify.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details". – kitestring Aug 01 '17 at 12:20
Is your file name contains special characters? – Mohamed Thasin ah Aug 01 '17 at 12:35
sys.setdefaultencoding('UTF8') can you try this one? – Mohamed Thasin ah Aug 01 '17 at 12:37
I am using python 3 so I get this error: "AttributeError: module 'sys' has no attribute 'setdefaultencoding'" – kitestring Aug 01 '17 at 13:36
Did you reload sys ? – Mohamed Thasin ah Aug 02 '17 at 04:49
try this, import sys reload(sys) sys.setdefaultencoding("utf-8") – Mohamed Thasin ah Aug 02 '17 at 04:52
I tried that an got this error "NameError: name 'reload' is not defined" – kitestring Aug 02 '17 at 12:29

score 0 · Accepted Answer · answered Aug 01 '17 at 14:12

I was able to figure this out. It's not the most eligant solution, but it works. I made a method that finds all csv files in the current working directory if any of the filenames contain a "µ" character replace with an "_". Return a list of all csv file names. I understand that this could potentially create naming conflicts, but since I'm the end user I'll be careful.

    # -*- coding: Latin-1 -*-
    import os
    import pandas as pd

    filenames = os.listdir(path_to_dir)
    filenames_fixed = []
    for filename in filenames:

        if filename.endswith(suffix) and 'µ' in filename:
            new_filename = filename.replace('µ', '_')
            os.rename(os.path.join(path_to_dir, filename), 
                os.path.join(path_to_dir, new_filename))
            filenames_fixed.append(new_filename)

        elif filename.endswith(suffix):
            filenames_fixed.append(filename)

        return filenames_fixed

    csv_list_cwd = find_csv_filenames_remove_nonASCII(os.getcwd())

    for csv_file in csv_list_cwd:
        df_cwd = pd.read_csv(csv_file, encoding="Latin-1")

pd.read_csv not sure how to determine the encoding for my csv files

2 Answers2