utf-8 encoding gives error in pd.read_csv()

Question

I have some folders at the location : log_files_path. All these folders contain CSVs with different names. My aim is to read all these csvs from all the folders present at log_files_path and collate them into a single dataframe. I wrote the following code :

all_files = pd.DataFrame()
     
    for region in listdir(log_files_path):
        region_log_filepath = join(log_files_path, region)
        #files stores file paths
        files = [join(region_log_filepath, file) for file in listdir(region_log_filepath) if isfile(join(region_log_filepath, file))]

        #appends data from all files to a single a DF all_files
        for file in files :
            all_files = all_files.append(pd.read_csv(file, encoding= 'utf-8')).reset_index(drop=True)
    return all_files

This gives me an error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 61033: invalid start byte On opening the CSVs, found out that some columns have values like :

and ƒÂ‚Ã‚Â‚ÃƒÂƒÃ‚ÂƒÃƒÂ‚Ã‚ÂƒÃƒÂƒÃ‚Â‚ÃƒÂ‚Ã‚.

I want to ignore such characters all together. How can I do it?

There is nothing like an universal or catch-all encoding. You should try to guess the encoding either *by hand* or with the [chardet](https://github.com/chardet/chardet) module. Only if you want to ignore any encoding problems you can go with the `Latin1` encoding which will accept any possible input, but will return garbage if the file uses a different encoding. — Serge Ballesta, Feb 01 '22 at 13:50

score 3 · Answer 1 · answered Feb 01 '22 at 13:32

3

You can pass encoding_errors='ignore', but I would advice to try different encoding first.

answered Feb 01 '22 at 13:32

buran

13,682
10
36
61

K.Mat · Answer 2 · 2022-02-17T13:51:03.447

tl;dr

For your situation, you probably need one of these:

pd.read_csv(file, encoding='utf-8', encoding_errors='replace')
# or
pd.read_csv(file, encoding='utf-8', encoding_errors='ignore')

Longer answer

As of version 1.3+ of pandas.read_csv() you can pass argument encoding_errors.

Possible values:

strict: Raise UnicodeError (or a subclass); this is the default.
ignore: Ignore the malformed data and continue without further notice.
replace: Replace with a suitable replacement marker; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding.
xmlcharrefreplace: Replace with the appropriate XML character reference (only for encoding).
backslashreplace: Replace with backslashed escape sequences.
namereplace: Replace with \N{...} escape sequences (only for encoding).
surrogateescape: On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
surrogatepass: Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.

score 0 · Answer 3 · answered Aug 01 '23 at 03:37

Referenced from @mikey's answer here UnicodeDecodeError when reading CSV file in Pandas, to get to know the encoding of the csv file before Pandas reading it.

from pathlib import Path
import chardet

filename = "file_name.csv"
detected = chardet.detect(Path(filename).read_bytes())
# detected is something like {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

encoding = detected.get("encoding")
assert encoding, "Unable to detect encoding, is it a binary file?"

df = pd.read_csv(filename, encoding=encoding)

utf-8 encoding gives error in pd.read_csv()

3 Answers3

tl;dr

Longer answer