String in pandas is not printing correctly

Question

I'm using pandas to load in a csv file containing twitter messages

corpus = pd.read_csv(data_path, encoding='utf-8')

Here is an example of the data

label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""

When I try to print the comment I get:

print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."

The \xa0 is still in the output. But if I paste the string from the file and print it, I get the correct output

print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point.  It seems that you are mixing apples and oranges.

I would like to know why the two outputs are different and if there is a way to get the string in pandas to be printed correctly? I would like if there is a better solution then just replace since the data contains many other Unicode representations such as \xe1, \u0111, \u01b0, \u1edd etc.

Possible duplicate of [Python: Removing \xa0 from string?](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string) — anky, Mar 11 '19 at 13:32
You want to remove all unicode characters from your column.? — Sreeram TP, Mar 11 '19 at 13:42

Rich Andrews · Answer 1 · 2019-03-11T17:28:57.450

The input data file that pandas loads must be in ASCII. If it were in UTF-8, the UTF-8 encoder would properly load the UTF-8 bytes. If the file is not UTF-8, pandas will still load, and the escaped \xa0 will loaded literally and not be translated to the desired unicode non-breaking space.

The reason why it works when you copy/paste is due to python seeing an escape in a string literal.

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")

pd.read_csv("/tmp/corpusutf8.csv")
                                             comment             date  label
0  "i really don't understand your point.  It see...  20120528192215Z      0
df['comment']
1    "i really don't understand your point.  It see...
Name: comment, dtype: object

file /tmp/corpus.csv
/tmp/corpusutf8.csv: UTF-8 Unicode text

If a csv is constructed with the \xa0 and is ascii, Pandas loads as ascii although a utf-8 encoding is specified.

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text

df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
   label             date                                            comment
0      0  20120528192215Z  "i really don't understand your point.\xa0 It ...

String in pandas is not printing correctly

1 Answers1