0

I am trying to read csv files using pd.read_csv. I am running into encoding issues and I’m not sure how to proceed. The first issue I running into is the following error message caused when reading csv fiels that contain a µ character.

“SyntaxError: Non-UTF-8 code starting with '\xb5' in file GUI_Simpilify.py on line 4, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details”

I’m able to get past this error by manually changing the file name and removing the µ. However, this is not a solution as I have 1,000’s of csv files to extract data from.

Once I manually remove the µ from a single csv file and rerun my script I get this error message: “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 13: invalid start byte”

I believe this is due to the fact that all of my csv files contain both ± and µ characters. How can I deal with both these errors without manual solutions?

Code:

    import pandas as pd
    test_csv = pd.read_csv('OFN 0.1pg_L Split 20-1 (5 fg on column).csv')
Bonifacio2
  • 3,405
  • 6
  • 34
  • 54
kitestring
  • 27
  • 9
  • Try: pd.read_csv(**u**'OFN 0.1pg_L Split 20-1 (5 fg on column).csv') – Anton vBR Aug 01 '17 at 12:16
  • Also, have you considered not typing in names and loop the directory. Look here: https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory – Anton vBR Aug 01 '17 at 12:17
  • Where do the csvs come from? Try to explicitly specify the encoding like this `pd.read_csv('filename.csv', encoding='utf8')`, instead of `utf8`, you can try `cp1250`, or `cp1252` for windows-like-encoding, or `'latin1` is quite common. Refere here for a more complete list https://docs.python.org/3/library/codecs.html#standard-encodings – redacted Aug 01 '17 at 12:17
  • The csv files are exported from a chemical analyzer instrument called a Time of Filght Mass Spectrometer. Since many of the chemical names contain greek charactrers and ranges for values I'm going to have to deal with characters such as ± and µ. I can read the file using 'latin-1', but only after manually removing the µ from the file name. – kitestring Aug 01 '17 at 12:43
  • Initially I did try looping the directory, because I have 1000's of csv files to load data from. I switched to typing the file names to simplify and limit possible error sources. Unfortunately, I'm beginning to think I'll have to write a script to change the file names and remove the µ character. Not the solution I was hoping for, but sometimes you just have to get things done and move forward. – kitestring Aug 01 '17 at 13:29

2 Answers2

1

This error because of without specifying encoding. Add this line at the beginning your python script

# -*- coding: utf-8 -*-
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
0

I was able to figure this out. It's not the most eligant solution, but it works. I made a method that finds all csv files in the current working directory if any of the filenames contain a "µ" character replace with an "_". Return a list of all csv file names. I understand that this could potentially create naming conflicts, but since I'm the end user I'll be careful.

    # -*- coding: Latin-1 -*-
    import os
    import pandas as pd

    filenames = os.listdir(path_to_dir)
    filenames_fixed = []
    for filename in filenames:

        if filename.endswith(suffix) and 'µ' in filename:
            new_filename = filename.replace('µ', '_')
            os.rename(os.path.join(path_to_dir, filename), 
                os.path.join(path_to_dir, new_filename))
            filenames_fixed.append(new_filename)

        elif filename.endswith(suffix):
            filenames_fixed.append(filename)

        return filenames_fixed

    csv_list_cwd = find_csv_filenames_remove_nonASCII(os.getcwd())

    for csv_file in csv_list_cwd:
        df_cwd = pd.read_csv(csv_file, encoding="Latin-1")
kitestring
  • 27
  • 9