Emoji, encode/decode when text file contain utf-8 and utf-16

Question

I have a text file which contains this:

....     
{"emojiCharts":{"emoji_icon":"\u2697","repost": 3, "doc": 3, "engagement": 1184, "reach": 6734, "impression": 44898}}
{"emojiCharts":{"emoji_icon":"\U0001f924","repost": 11, "doc": 11, "engagement": 83, "reach": 1047, "impression": 6981}}
....

some of the emojis are in \uhhhh format, some of them in \Uhhhhhhhh format.

Does exist any way to encode/decode it to display emojis? Because if the file contains ONLY \Uhhhhhhhh then everything works fine.

TO come to this stage I have modified file this way:

insightData.decode("raw_unicode_escape").encode('utf-16', 'surrogatepass').decode('utf-16').encode("raw_unicode_escape").decode("latin_1")

To display emojis i need to use this:

insightData.decode("raw_unicode_escape").encode('utf-16', 'surrogatepass').decode('utf-16')

BUT it displays an error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 30: ordinal not in range(128)

SOLUTION:

with open(OUTPUT, "r") as infileInsight:
    insightData = infileInsight.read()\
    .decode('raw_unicode_escape')

with open(OUTPUT, "w+") as outfileInsight:
    outfileInsight.write(insightData.encode('utf-8'))

When does the UnicodeEncodeError show up? When doing `print` in a Python console? Which python version? Which operating system? — tzot, Sep 06 '18 at 12:05

score 1 · Answer 1 · answered Sep 06 '18 at 11:59

1

You can just do this.

print a["emojiCharts"]["emoji_icon"].decode("unicode-escape")

Output: ⚗

answered Sep 06 '18 at 11:59

vks

67,027
10
91
124

This is not pure JSON file, please treat it as text, and I need to decode file when both \uhhh and \Uhhhhhhhh are inside – Sep 06 '18 at 12:00
@NANA using this method both will work.And you do json.loads first. – vks Sep 06 '18 at 12:06

score 1 · Answer 2 · answered Sep 06 '18 at 12:08

This has nothing to do with UTF-8 or UTF-16. It’s just Python’s way to escape Unicode characters in general, with everything below U+FFFF using \uFFFF and everything above using \UFFFFFFFF (for historical reasons).

Both escape sequences should work exactly equally in a Python string. On my machine, using @vks’s solution:

$ python
Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> '\U0000ABCD'.decode('unicode-escape')
u'\uabcd'
>>> '\uABCD'.decode('unicode-escape')
u'\uabcd'

and similar for Python 3.

initially, when I'm getting a result I convert the file to utf-8 format, and then I modify it and write it to export file, will it cause? — , Sep 06 '18 at 12:15

score 0 · Answer 3 · answered Sep 06 '18 at 12:15

0

OK. Python 2.7, Win 10.

Your original file is plain ASCII, containing the exact unicode escapes ("\u####", 6 bytes, and "\U########", 10 bytes).

Read the file and decode using 'unicode-escape': then you have a Python unicode string; let's call it your_unicode_string.

To write a file, choose either:

output_encoding = 'utf-8'

or

output_encoding = 'utf-16-le'

and then:

import codecs
with codecs.open(output_filename, 'w', encoding=output_encoding) as fpo:
    # fpo.write(u'\ufeff') # for windows, you might want to write this at the start
    fpo.write(your_unicode_string)

For your given python and os version and without any tampering, you won't be able to just print to the console and see emojis.

answered Sep 06 '18 at 12:15

tzot

92,761
29
141
204

https://stackoverflow.com/questions/52199674/big-query-do-not-accept-emoji/52201558?noredirect=1#comment91352186_52201558 Initially an issue was that BigQuery does not accept Emojis in \Uhhhhhhhh format as it just prints as a text, then after checks, no issue with BQ and emoji can be viewed, so the problem with the file, (the way it encoded).... I'm just lost in this forest -___- from API emojis was in format `\uhhhh`, `\uhhhh\uhhhh` or `\uhhhh\uhhhh\uhhhh` and so on, then I converted to `\Uhhhhhhhh` which suppose to work with goolge BQ, and not based of feedback it is encoding issue – Sep 06 '18 at 12:26
Basically Ideally I need that in the text file instead of \Uhhhhhhhh code will be actual emoji, will it possible at all?? – Sep 06 '18 at 12:43
You read the input file and decode using `raw_unicode_escape`; saving this decoded unicode string in a text file using an encoding like `utf-8` or `utf-16-le`, the actual emoji will be stored. This is what my answer above does. – tzot Sep 07 '18 at 12:18

Emoji, encode/decode when text file contain utf-8 and utf-16

3 Answers3