9

When I try to use:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_csv('sentiment_data.csv')

I get the error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 388: surrogates not allowed

I don't understand what this error means and how I can fix this error and export my data to a csv/excel. I have referred to this question but I don't understand much and it doesn't answer how to do this with pandas.

What does position 388 mean? What is the character '\ud83d'?

I get a different error position when I try to export to an excel:

df[df.columns.difference(['pos', 'neu', 'neg', 'new_description'])].to_excel('sentiment_data_new.xlsx')

Error while exporting to excel:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 261: surrogates not allowed

Why is the position different when it's the same encoding?

The other duplicate questions don't answer how to escape this error with pandas DataFrame.

Community
  • 1
  • 1
Mohit Motwani
  • 4,662
  • 3
  • 17
  • 45
  • 2
    The codepoint D83D is the first element of a surrogate pair, the second being almost certainly in the emoticon range DE00–DE4F. Suppose the second of the pair is DE04. Then together they make a surrogate for codepoint 1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES . – BoarGules Feb 05 '19 at 14:38
  • try the encoding parameter, by default its None – iamklaus Feb 05 '19 at 15:29
  • Can you somehow localize, which text in your dataframe gives this error? It will be very helpful if we could see the row in your dataframe which causes this error – Teoretic Feb 05 '19 at 18:43
  • @Teoretic I know that error is at 131086, because my dataframe is written till the previous row in the csv. When I try to print this row I get the same error. :{ – Mohit Motwani Feb 06 '19 at 02:00
  • @BoarGules Can you explain what a surrogate pair means? and why we need them? – Mohit Motwani Feb 06 '19 at 02:01
  • @Teoretic I think the char causing trouble is � – Mohit Motwani Feb 06 '19 at 06:46
  • 1
    Possible duplicate of [How to work with surrogate pairs in Python?](https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python) – tripleee Feb 06 '19 at 07:29

3 Answers3

23

Emojis in Unicode lie outside the Basic Multilingual Pane, which means they have codepoints that won't fit in 16 bits. Surrogate pairs are a way to make these glyphs directly representable in UTF-16 as a pair of 16-bit codepoints.

You can force surrogate pairs to be resolved into the corresponding codepoint outside the BMP like this:

"\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16')

This will give you the codepoint \U0001f604. Note how it takes more than 4 hex digits to express.

But this solution may only get you so far.

A lot of software (including pygame and older versions of IDLE, and PowerShell, and the Windows command prompt) only supports the BMP, because it doesn't really use UTF-16 but its predecessor UCS-2, which is essentially UTF-16 but without support for codepoints outside the BMP.

When this answer was originally posted, in IDLE 3.7 and before, print ('\U0001f604') would just raise a UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f604' in position 0: Non-BMP character not supported in Tk.

Python 3.8 finally fixed this and the fixes were backported to subsequent releases of Python 3.7, so in IDLE now, you can either provide the 17-bit codepoint:

print ('\U0001f604')

or transcode the UTF-16 surrogate pair to the same codepoint:

print ("\ud83d\ude04".encode('utf-16','surrogatepass').decode('utf-16'))

and both will print .

What you still cannot do is print the UTF-16 surrogate pair as is: if you try print ("\ud83d\ude04") you will get the same \u escapes back.

BoarGules
  • 16,440
  • 2
  • 27
  • 44
1

You can delete all emojis using a regex pattern:

import re

def remove_emojis(string):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F" # emoticons
        u"\U0001F300-\U0001F5FF" # symbols & pictographs
        u"\U0001F680-\U0001F6FF" # transport & map symbols
        u"\U0001F1E0-\U0001F1FF" # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", 
        flags=re.UNICODE
    )
    
    return emoji_pattern.sub(r'', string)

remove_emojis("jajajajajaj ")

Credits to: https://medium.com/geekculture/text-preprocessing-how-to-handle-emoji-emoticon-641bbfa6e9e7

0

I had this issue too and you can use replace method in the string to replace '\ud83c' before to_csv, etc.

For example:

my_string_list = [i.replace('\ud83c', ' ') for i in my_string_list]

and then you will be able to_csv, etc.

masanmola
  • 11
  • 6