0

i am trying to get rid off strings like \xa0 \xc2 etc.. I know that is an encoding problem but how i'll do this ? Non of utf-8 , "ISO-8859-1" encoding option worked for me..

train = pd.read_csv('./data/train.csv',index_col = False,low_memory = False,encoding='utf-8')

test = pd.read_csv('./data/test.csv',index_col = False,low_memory = False,encoding="ISO-8859-1")

This is the output after using

train = pd.DataFrame(data = train)
print(train)
        Insult  Date    Comment

1   0   20120528192215Z "i really don't understand your point.\xa0 It ...
2   0   NaN "A\\xc2\\xa0majority of Canadians can and has ...
3   0   NaN "listen if you dont wanna get married to a man...
4   0   20120619094753Z "C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...
TuMama
  • 307
  • 1
  • 2
  • 8
  • 1
    Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) – wwii Jun 03 '20 at 15:35
  • See https://stackoverflow.com/a/436299/5386938 –  Jun 03 '20 at 15:44

1 Answers1

0

You can try regex like this :

string_cleaned = "string_contatining_unicode_or_latin".replace(u'\xa0', u' ')

For more info : https://docs.python.org/3/howto/unicode.html

Another best way most recommended : unicodedata.normalize

Hope it helps

Nandan Pandey
  • 148
  • 1
  • 11