0

EDIT

I've a text file containing sentences including emojis that I cannot handle correctly.

My csv file contains those sentences :

  • Je suis sur que certaines personnes vont faire la file pour toucher cette borne unicode-d83d\ude02

  • Aurelie Gouverneur voir même la lechée peut être unicode-d83d\ude02unicode-d83d\ude02unicode-d83e\udd2e

  • Mélanie Ham même ce prendre en photo avec unicode-d83e\udd23

My code :

df_test=pd.read_csv("myfile.csv", sep=';',index_col=None, encoding="utf-8")

for item, row in df_test.iterrows():
    print(repr(row["Message"]))
    s=row["Message"]
    s = re.sub(r'unicode-([0-9a-f]{4})',lambda m: chr(int(m.group(1),16)),s)
    s = s.encode('utf16','surrogatepass').decode('utf16')

The printed results :

'Je suis sur que certaines personnes vont faire la file pour toucher cette borne unicode-d83d\\ude02'
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-40-e3c423a15acd> in <module>
      5     s=row["Message"]
      6     s = re.sub(r'unicode-([0-9a-f]{4})',lambda m: chr(int(m.group(1),16)),s)
----> 7     s = s.encode('utf16','surrogatepass').decode('utf16')

UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 162-163: illegal UTF-16 surrogate

The issue is probably related to the encoding when I load the csv into a dataframe, but I've no idea how to solve this...

Community
  • 1
  • 1
  • Does this answer your question? [How to work with surrogate pairs in Python?](https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python) – Jongware Feb 10 '20 at 16:49
  • Unfortunatly, I tried this solution but it doesn't work for me... I started by replacing "unicode-" to "\u". But it writes "\\ud83d\ude44" which is not understood as emoji by encode / decode functions – Erwan Le Nagard Feb 11 '20 at 10:41
  • I'm going to have to ask *what you exactly got*. Literal bytes, please. Currently your post contains `"This is some text and emoji **unicode-d83d\ude44**"`, which is ambiguous enough. Are the stars part of it? (I guess not -- but I can guess all day long.) No backslash before `d83d`? A single backslash before `ude44` -- or, is it? Or is it part of Python's string `repr`? Only you can tell us. – Jongware Feb 11 '20 at 11:13
  • I have a string. str = "This is some text and emoji unicode-d83d\ude44". My objective is to turn "unicode-d83d\ude44" into "\ud83d\ude44", so I could print the emoji using "\ud83d\ude44".encode('utf-16', 'surrogatepass').decode('utf-16') – Erwan Le Nagard Feb 11 '20 at 12:21

1 Answers1

0

The text is a combination of a Unicode escape and a custom syntax. This will decode as described by capturing the hexadecimal values of the two escape codes, then formatting them into a JSON-formatted pair of surrogates and letting that module convert to the correct Unicode code point.

#coding:utf8
import re
import json

sentences = [r'Je suis sur que certaines personnes vont faire la file pour toucher cette borne unicode-d83d\ude02',
             r'Aurelie Gouverneur voir même la lechée peut être unicode-d83d\ude02unicode-d83d\ude02unicode-d83e\udd2e',
             r'Mélanie Ham même ce prendre en photo avec unicode-d83e\udd23']

def surrogates_to_unicode(m):
    upper = int(m.group(1),16)
    lower = int(m.group(2),16)
    return json.loads(f'"\\u{upper:04x}\\u{lower:04x}"')

for s in sentences:
    s = re.sub(r'unicode-([0-9a-f]{4})\\u([0-9a-f]{4})',surrogates_to_unicode,s)
    print(s)
Je suis sur que certaines personnes vont faire la file pour toucher cette borne 
Aurelie Gouverneur voir même la lechée peut être 
Mélanie Ham même ce prendre en photo avec 
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • actually, my input file is a csv that I load into a dataframe. My understanding is that the CSV has been saved by a java program which encode the emojis into a java standard (ex : unicode-d83d\ude44). But, when I load my csv in a python dataframe, those emojis are encoded as a string and not unicode characters... – Erwan Le Nagard Feb 11 '20 at 10:46
  • @Erwan that’s no standard I’ve ever seen. Why aren’t both surrogates using \u escapes? – Mark Tolonen Feb 11 '20 at 20:07
  • Thank you so much !!!! :) Honestly I don't know why they formated the emojis like this.... – Erwan Le Nagard Feb 12 '20 at 09:38
  • I've still some difficulties to load a csv containing those wrong formated emojis. The error code says : UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 162-163: illegal UTF-16 surrogate `df_test=pd.read_csv("myfile.csv", sep=';',index_col=None, encoding="utf-8") for item, row in df_test.iterrows(): print(row["Message"]) s=row["Message"] print(s) s = re.sub(r'unicode-([0-9a-f]{4})',lambda m: chr(int(m.group(1),16)),s) s = s.encode('utf16','surrogatepass').decode('utf16') print(s)` – Erwan Le Nagard Feb 24 '20 at 09:37
  • @Erwan Edit your question with your example. Be precise on the content of the string that fails. Use `repr()` to display. – Mark Tolonen Feb 24 '20 at 15:39
  • I've edited my question with more details. Thanks for your help. – Erwan Le Nagard Feb 25 '20 at 10:35
  • @Erwan Updated. The problem with the original answer is it didn't take into account that the second surrogate wasn't a Unicode code point but another Unicode escape. JSON would have been a better format for the original sentences. – Mark Tolonen Feb 25 '20 at 16:02