4

Referred to this question: Emoji crashed when uploading to Big Query

I'm looking for the best and clean way to encode emojis from this \ud83d\ude04 type to this one (Unicode) - \U0001f604 because currently, I do not have any idea except create python method which will pass through a text file and replace emoji coding.

This is the string can be converted:

Converting emojis to Unicode and vice versa in python 3

As an assumption, maybe need to pass through text line by line and convert it??

Potential Idea:

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))
  • `\ud83d\ude04` looks like UTF-16 **en**coding of the code point [`U+1F604`](https://www.charbase.com/1f604-unicode-smiling-face-with-open-mouth-and-smiling-eyes), you can not further "encode the encoding" (it doesn't make any sense whatsoever), you can only **de**code it into code points, and then again encode it using some different encoding. What exactly do you want? How is splitting the text into lines supposed to help? – Andrey Tyukin Sep 05 '18 at 11:07
  • @AndreyTyukin go through line by line find value starts with "\u" and change it to Unicode, to be fair this is only assumptions, I just have stuck with this question... and believe any crazy idea better than nothing...and currently out of ideas what can be done. Currently, I have the text file (modified JSON file) referred question and all what I need is, to change this emoji `\ud83d\ude04` to this one `U+1F604` and apply this changes to all emojis in the file.... –  Sep 05 '18 at 11:30
  • So, you have a text file that contains escape sequences of shape `\uhhhh` (with hexadecimal numbers) that represent UTF-16 code units. Some pairs of those code units are surrogate pairs representing Unicode code points in the range U+10000 to U+10FFFF. You want to extract those code points, and write them out as formatted strings in the format `\Uhhhhhhhh`. Correct? – Andrey Tyukin Sep 05 '18 at 13:10
  • @AndreyTyukin yes, change emoji format from `\ud83d\ude04` to `U0001F604` and this rule applies to all emojis inside text file –  Sep 05 '18 at 13:15
  • @AndreyTyukin do you think this is possible to do?? –  Sep 05 '18 at 13:33

2 Answers2

7

So, I'll assume that what you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs, and that you (for whatever reason) want to convert it to \UXXXXXXXX-format.

So, henceforth I assume that your input (bytes!) look like this:

weirdInput = "hello \\ud83d\\ude04".encode("latin_1")

Now you want to do the following:

  1. Interpret the bytes in a way that \uXXXX thingies are transformed into UTF-16 code units. There is raw_unicode_escapes, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest)
  2. Fix the surrogate pairs, transform the data into valid UTF-16
  3. Decode as valid UTF-16
  4. Again, encode as "raw_unicode_escape"
  5. Decode back as good old latin_1, consisting only of good old ASCII with unicode escape sequences in format \UXXXXXXXX.

Something like this:

  output = (weirdInput
    .decode("raw_unicode_escape")
    .encode('utf-16', 'surrogatepass')
    .decode('utf-16')
    .encode("raw_unicode_escape")
    .decode("latin_1")
  )

Now if you print(output), you get:

hello \U0001f604

Note that if you stop at an intermediate stage:

smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

then you get a unicode-string with smileys:

print(smiley)
# hello 

Full code:

weirdInput = "hello \\ud83d\\ude04".encode("latin_1")

output = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
  .encode("raw_unicode_escape")
  .decode("latin_1")
)


smiley = (weirdInput
  .decode("raw_unicode_escape")
  .encode('utf-16', 'surrogatepass')
  .decode('utf-16')
)

print(output)
# hello \U0001f604

print(smiley)
# hello 
Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • @NANA Glad to help; Note however, that maybe it would have been easier to avoid any "unsanctioned modifications" on the original JSON input, and just feed it to the JSON parser, it would take care of the surrogate pairs automatically. Also, double-check whether the assumption about `latin_1` is actually correct. – Andrey Tyukin Sep 05 '18 at 14:17
  • It does work, as it has to be uploaded to Big Query, I cannot use normal JSON, unfortunately, I think I will never get to this kind of solution if I will continue the search for the way of doing it... Thank you! Hope it will help someone else as well! –  Sep 05 '18 at 14:20
  • Thank you @Andrey, I was looking in so many posts, but your solution is the only one that worked. After getting the unicode representation with your code, I also used `import unicodedata as ud` and `ud.name('\U0001f604')` to get the name of the emoji, and potentially replace the code with it. – KLaz Mar 14 '22 at 18:25
0

\ud83d\ude04 is the utf16 representation of the character SMILING FACE WITH OPEN MOUTH AND SMILING EYES (U+1F604) You will need to decode it into a character then convert the code point of the character into a hex string. I don't know enough Python to tell you how to do this.

JGNI
  • 3,933
  • 11
  • 21
  • If the actual emoji will be a string, that is understandable... BUT as part of the text file, have not a clue, even the logic how to do it. –  Sep 05 '18 at 08:03