CSV encoding , Pandas Data frame

Question

i am trying to get rid off strings like \xa0 \xc2 etc.. I know that is an encoding problem but how i'll do this ? Non of utf-8 , "ISO-8859-1" encoding option worked for me..

train = pd.read_csv('./data/train.csv',index_col = False,low_memory = False,encoding='utf-8')

test = pd.read_csv('./data/test.csv',index_col = False,low_memory = False,encoding="ISO-8859-1")

This is the output after using

train = pd.DataFrame(data = train)
print(train)

        Insult  Date    Comment

1   0   20120528192215Z "i really don't understand your point.\xa0 It ...
2   0   NaN "A\\xc2\\xa0majority of Canadians can and has ...
3   0   NaN "listen if you dont wanna get married to a man...
4   0   20120619094753Z "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...

Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) — wwii, Jun 03 '20 at 15:35

score 0 · Answer 1 · answered Jun 03 '20 at 15:38

0

You can try regex like this :

string_cleaned = "string_contatining_unicode_or_latin".replace(u'\xa0', u' ')

For more info : https://docs.python.org/3/howto/unicode.html

Another best way most recommended : unicodedata.normalize

Hope it helps

answered Jun 03 '20 at 15:38

Nandan Pandey

148
1
11

CSV encoding , Pandas Data frame

1 Answers1