2

I have a JSON file that store text data called stream_key.json :

{"text":"RT @WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&f.wbali@whotels.com https:\/\/t.co\/fRoAanOkyC"}

As we can see that the text in the json file contain unicode \u2019, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):

import re
import json

def removeunicode(text):
    text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
    text = re.sub(r'\\[u]\S\S\S\S', "", text)
    return text

with open('stream_key.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet['text']
        text = removeunicode(text)
        print(text)

The result i get is:

Traceback (most recent call last):
  File "eraseunicode.py", line 17, in <module>
    print(text)
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>

As I already use function to remove the \u2019 before print, I don't understand why it is still error. Please Help. Thanks

ytomo
  • 809
  • 1
  • 7
  • 23
  • Possible duplicate of [Removing unicode \u2026 like characters in a string in python2.7](http://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2-7) – tripleee Apr 11 '17 at 08:42
  • @tripleee I already try that. its actually different. In my case it is from a json file. thanks by the way. – ytomo Apr 11 '17 at 08:49

1 Answers1

1

When the data is in a text file, \u2019 is a string. But once loaded in json it becomes unicode and replacement doesn't work anymore.

So you have to apply your regex before loading into json and it works

tweet = json.loads(removeunicode(line))

of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text by checking character code like this (note that it is not strictly equivalent):

 text = "".join([x for x in tweet['text'] if ord(x)<128])
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219