2

I have simple (but extremely hard) question.

I'm looking for a way to convert a text file which contains this type of emoji code (\ud83d\udc40) and replace it with the one which will contain - actual emoji symbol

E.G.

with open(OUTPUT, "r+") as infileInsight:

    insightData = infileInsight.read()\
       .replace('\ud83d\udc40','')\
       ......

    with open(OUTPUT, "w+") as outfileInsight:
            outfileInsight.write(insightData)

Regarding, that it is duplicated: If I do this way:

with open(OUTPUT, "r+") as infileInsight:

    insightData = infileInsight.read()\
       .replace('\ud83d\udc40','')\
       ......

    with open(OUTPUT, "w+") as outfileInsight:
            outfileInsight.write(insightData.decode('unicode-escape'))

I have an error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 30: ordinal not in range(128)

  • Not really extremely hard, more like a duplicate of [text with unicode escape sequences to unicode in python](https://stackoverflow.com/questions/4004431/text-with-unicode-escape-sequences-to-unicode-in-python)... – Nils Werner Sep 06 '18 at 14:10
  • @NilsWerner This solution will not help, have tried it already.... it looks simple, but it just does not work...( –  Sep 06 '18 at 14:11
  • Did you try the solution here https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python. I know they are removing it, but since you can find it then you can replace instead of just removing. – devdob Sep 06 '18 at 14:15
  • 1
    Does the `infileInsight` file contain the literal characters `\u`? If so, when searching for something to replace, you'll need to have a string that contains those characters. The string you pass into `replace` doesn't contain a literal `\u`, it contains a unicode escape. – Daniel Pryden Sep 06 '18 at 14:15
  • @DanielPryden have edit what file contains –  Sep 06 '18 at 14:18
  • 2
    Which version of Python are you using? The input data looks like it's ASCII, but what is the output encoding? (You're not specifying an output encoding, so the results depend on what version of Python you're using and on which platform. If you want to write Unicode characters you're going to need a Unicode-compatible encoding.) – Daniel Pryden Sep 06 '18 at 15:11
  • To fix your latest error use `open(OUTPUT, "w+", encoding="utf-8")`. – Mark Ransom Sep 06 '18 at 15:39
  • @MarkRansom does not solve my issue, I even separate the file from `\uhhhh` and `\uhhhh\uhhhh`, just do not want to change it....... –  Sep 07 '18 at 08:44

1 Answers1

2

You just need the ensure_ascii=False option in json.dump.

If you're creating this file in the first place, just pass that option.

If someone else gave you this JSON file and you want to change it to use Unicode characters directly in strings (as opposed to Unicode escapes as it is now), you can do something like this:

import json

with open('input.txt', 'r') as infile:
    with open('output.txt', 'w') as outfile:
        for line in infile:
            data = json.loads(line)
            json.dump(data, outfile, ensure_ascii=False)
            outfile.write('\n')
Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • 1
    Does the `json` module automatically convert surrogate code points into a single character? – Mark Ransom Sep 06 '18 at 14:28
  • Oh, I see... it's not just decoding the escapes, it's actually doing normalization. I wonder if that's a requirement for this implementation or if it's just a side-effect of the OP's approach. – Daniel Pryden Sep 06 '18 at 14:29
  • @DanielPryden still have the same issue - UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 1: ordinal not in range(128), em I doing something wrong? –  Sep 06 '18 at 14:32
  • @DanielPryden btw, I have provided a sample data for testing. –  Sep 06 '18 at 14:33