UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2: invalid start byte, tried all encoding styles

Question

ad
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 826, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 841, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1052, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1238, in pandas._libs.parsers.TextReader._string_convert
  File "pandas\_libs\parsers.pyx", line 1429, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2: invalid start byte

I am getting above error while reading my CSV

to rectify this I used unicode escape:

csv_df=pd.read_csv(file_path,header=0,squeeze=True,dtype=str,keep_default_na=False,encoding='unicode_escape')

However, Now I am getting \xa0 for space between two words:

'ObjectStatus': 'IN\xa0SERVICE'

My CSV has:

Key          Values
RequestID   
ObjectType   CONTAINER
ObjectName   INMUNVMBMHPBNB6001ENBCMW005
ObjectStatus IN SERVICE
ObjectType   CONTAINER

https://stackoverflow.com/questions/21504319/python-3-csv-file-giving-unicodedecodeerror-utf-8-codec-cant-decode-byte-err — USERNAME GOES HERE, Sep 18 '20 at 13:11
`\xa0` is a Unicode U+00A0 NO-BREAK SPACE. Python is displaying the string with a Unicode escape sequence so you can see it isn't a regular space. If you `print` the value, it will show as a space. — Mark Tolonen, Sep 18 '20 at 21:46
Actually, this dictionary is passed as a request to zeep client obiect. There it is getting converted to question mark character** — Priyal Mangla, Sep 19 '20 at 21:48

Mark Tolonen · Accepted Answer · 2020-09-18T22:23:46.060

1

The unicode_escape codec is for literal escape codes (length 4 \\xa0 vs. length 1 \xa0). As displayed, that's just Python's debug representation of the string, and it prints \xa0 to show that it isn't a regular space. You're file is probably encoded in cp1252 or latin1, as \xa0 is the NO-BREAK SPACE in those encodings.

Example:

>>> d = {'ObjectStatus': 'IN\xa0SERVICE'}
>>> d
{'ObjectStatus': 'IN\xa0SERVICE'}
>>> print(d['ObjectStatus'])
IN SERVICE

edited Sep 18 '20 at 22:23

answered Sep 18 '20 at 21:50

Mark Tolonen

166,664
26
169
251

Actually, this dictionary is passed as a request to zeep client obiect. There it is getting converted to question mark character – Priyal Mangla Sep 19 '20 at 21:48

score 0 · Answer 2 · answered May 31 '21 at 11:23

For me below has worked. Using str.replace() I have replaced all values in the column having '\xa0' with ' '

csv_df = pd.read_csv(file_path, header=0, squeeze=True,dtype=str, keep_default_na=False)

csv_df['Values'] = csv_df['Values'].astype(str).str.replace(u'\xa0', ' ')

I had to pass these values into an another function which created an XML, tried all encodings, none worked.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2: invalid start byte, tried all encoding styles

2 Answers2