1

I'm using pandas to load in a csv file containing twitter messages

corpus = pd.read_csv(data_path, encoding='utf-8')

Here is an example of the data

label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""

When I try to print the comment I get:

print(corpus.iloc[1]['comment'])
>> "i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."

The \xa0 is still in the output. But if I paste the string from the file and print it, I get the correct output

print("""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.""")
>> i really don't understand your point.  It seems that you are mixing apples and oranges.

I would like to know why the two outputs are different and if there is a way to get the string in pandas to be printed correctly? I would like if there is a better solution then just replace since the data contains many other Unicode representations such as \xe1, \u0111, \u01b0, \u1edd etc.

rpanai
  • 12,515
  • 2
  • 42
  • 64
o1-steve
  • 321
  • 1
  • 4
  • 11
  • 1
    Possible duplicate of [Python: Removing \xa0 from string?](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string) – anky Mar 11 '19 at 13:32
  • You want to remove all unicode characters from your column.? – Sreeram TP Mar 11 '19 at 13:42

1 Answers1

0

The input data file that pandas loads must be in ASCII. If it were in UTF-8, the UTF-8 encoder would properly load the UTF-8 bytes. If the file is not UTF-8, pandas will still load, and the escaped \xa0 will loaded literally and not be translated to the desired unicode non-breaking space.

The reason why it works when you copy/paste is due to python seeing an escape in a string literal.

import pandas as pd
data = {u"label": 0, u"date": u"20120528192215Z", u"comment": u"\"i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.\""}
df = pd.DataFrame(index=[1], data=data)
df.to_csv("/tmp/corpusutf8.csv", index=False, encoding="utf-8")
pd.read_csv("/tmp/corpusutf8.csv")
                                             comment             date  label
0  "i really don't understand your point.  It see...  20120528192215Z      0
df['comment']
1    "i really don't understand your point.  It see...
Name: comment, dtype: object

file /tmp/corpus.csv
/tmp/corpusutf8.csv: UTF-8 Unicode text

If a csv is constructed with the \xa0 and is ascii, Pandas loads as ascii although a utf-8 encoding is specified.

cat /tmp/corpusascii.csv
label,date,comment
0,20120528192215Z,"""i really don't understand your point.\xa0 It seems that you are mixing apples and oranges."""
file !$
file /tmp/corpusascii.csv
/tmp/corpusascii.csv: ASCII text
df1 = pd.read_csv("/tmp/corpusascii.csv", encoding="utf-8")
df1
   label             date                                            comment
0      0  20120528192215Z  "i really don't understand your point.\xa0 It ...
Rich Andrews
  • 1,590
  • 8
  • 12