Removing Unicode \uxxxx in String from JSON Using Regex

Question

I have a JSON file that store text data called stream_key.json :

{"text":"RT @WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&amp;f.wbali@whotels.com https:\/\/t.co\/fRoAanOkyC"}

As we can see that the text in the json file contain unicode \u2019, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):

import re
import json

def removeunicode(text):
    text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
    text = re.sub(r'\\[u]\S\S\S\S', "", text)
    return text

with open('stream_key.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet['text']
        text = removeunicode(text)
        print(text)

The result i get is:

Traceback (most recent call last):
  File "eraseunicode.py", line 17, in <module>
    print(text)
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>

As I already use function to remove the \u2019 before print, I don't understand why it is still error. Please Help. Thanks

Possible duplicate of [Removing unicode \u2026 like characters in a string in python2.7](http://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2-7) — tripleee, Apr 11 '17 at 08:42
@tripleee I already try that. its actually different. In my case it is from a json file. thanks by the way. — ytomo, Apr 11 '17 at 08:49

score 1 · Accepted Answer · answered Apr 11 '17 at 08:34

When the data is in a text file, \u2019 is a string. But once loaded in json it becomes unicode and replacement doesn't work anymore.

So you have to apply your regex before loading into json and it works

tweet = json.loads(removeunicode(line))

of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text by checking character code like this (note that it is not strictly equivalent):

 text = "".join([x for x in tweet['text'] if ord(x)<128])

big thanks for the explanation between loading json and text file, really help me. — ytomo, Apr 11 '17 at 08:48

Removing Unicode \uxxxx in String from JSON Using Regex

1 Answers1

Linked