2

I am trying to parse a CSV file (from an external data source) where one of the columns uses inconsistent character encodings. Rather than trying to get the data provider to use a consistent encoding, I would like to just read that column as binary data. However, pandas.read_csv seems to decode the whole file to a string before parsing, so this is giving me errors (UnicodeDecodeError). Here's a toy example (python 3):

>>> from io import BytesIO
>>> import pandas as pd
>>> csv = b'Encoding,Data\nascii,abc\nwindows-1252,\xae\nutf-8,\xe2\x80\x9c1\xe2\x80\x9d\n'
>>> pd.read_csv(BytesIO(csv))
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1254, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1269, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1459, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte

I would like a result equivalent to this:

>>> df = pd.DataFrame({'Encoding': ['ascii','windows-1252','utf-8'],
...                    'Data': [b'abc',b'\xae',b'\xe2\x80\x9c1\xe2\x80\x9d']})
>>> df
       Encoding                          Data
0         ascii                        b'abc'
1  windows-1252                       b'\xae'
2         utf-8  b'\xe2\x80\x9c1\xe2\x80\x9d'

Which could (in this toy example) be further processed like this:

>>> df.apply(lambda row: str(row.Data,row.Encoding), axis=1)
0    abc
1      ®
2    “1”
dtype: object

I'd prefer solutions using only pandas, but I'm willing to look at other parsing libraries if absolutely necessary.

Quantum7
  • 3,165
  • 3
  • 34
  • 45

1 Answers1

7

According to Serge Ballesta answering this post

"Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case."

  1. So first, try using Latin1 encoding because it accept any possible byte as input and might be sufficient depending of your use-case (I could run your toy example using this):

    data_frame = pd.read_csv(BytesIO(csv), encoding="latin1"))
    
  2. As specified in Serge's answer : "Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object." For your case you might consider using 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    with open(path_to_csv, encoding=file_encoding, errors = 'backslashreplace') as my_csv:
      dataframe = pd.read_csv(my_csv)
    
Vee
  • 297
  • 1
  • 7
  • The latin1 solution worked well for me. The original bytes can be easily recovered and the correct encoding applied (`df.apply(lambda row: row.Data.encode('latin1').decode(row.Encoding), axis=1)`. I wasn't able to figure out how to reverse the backslashreplace operation. – Quantum7 Apr 24 '20 at 14:56