0

I have a CSV file opened with 'latin1' encoding. However, there seems to be a problem with reading emojis. I want to remove all the emojis. It shows as square box and when I change to list, it changes to "\x80". Is there any way I can remove this??

df = pd.read_csv(r"myfilepath", encoding='latin1')

I have a CSV file opened with 'latin1' encoding. However, there seems to be a problem with reading emojis. I want to remove all the emojis. It shows as square box and when I change to list, it changes to "\x80". Is there any way I can remove this??

dkdlfls26
  • 171
  • 1
  • 9
  • 2
    "opened with 'latin1' encoding ... problem with reading emojis" The Latin1 encoding does not support emojis. If your file contains emojis, it's not Latin1 encoded. Do you know the appropriate encoding of your file, e.g. UTF-8? Why don't you use the correct encoding, but use Latin1 instead? – MisterMiyagi Apr 16 '20 at 13:45
  • The correct encoding is almost certainly UTF-8 – snakecharmerb Apr 16 '20 at 13:49
  • @MisterMiyagi This is the error message I get whenever I tried to open the file with UTF-8. <> – dkdlfls26 Apr 17 '20 at 01:36

2 Answers2

0

Try ASCII conversion, although this is for deleting the Emojis:

l_data = [x.encode('ascii', 'ignore').decode('ascii') for x in l_data]

If you want to remove a particular character:

l_data = [x.replace('\x80', '') for x in l_data]

Answer motivated by this

Cblopez
  • 446
  • 2
  • 12
  • Thank you for your answer! But could you please tell me what 'l_data' and 'x' in your code indicate in my case?? – dkdlfls26 Apr 16 '20 at 13:40
  • l_data is the name of the list of CSV lines. You named it X_data I think, my bad, but the important thing is that identifies the list however its called. `x` is the reference for the variable used to identify each element inside the list. That syntax used there is called List Comprehension, you can have al look [here](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python). It translates into "Create a list with all the elements of `l_data`, but before inserting replace '\x80' with empty string" – Cblopez Apr 16 '20 at 14:25
  • Thank you so much! Your code works and thanks again for a kind explanation! – dkdlfls26 Apr 17 '20 at 01:40
  • No problem! Would appreciate if you could validate de answer. Thanks – Cblopez Apr 17 '20 at 07:31
0

try this

df = pd.read_csv(r"myfilepath", encoding='iso-8859-1')

see this link below

UnicodeEncodeError : 'charmap' codec can't encode character '\x80' in position 0 : character maps to <undefined>