I have a JSON file that store text data called stream_key.json
:
{"text":"RT @WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&f.wbali@whotels.com https:\/\/t.co\/fRoAanOkyC"}
As we can see that the text in the json file contain unicode \u2019
, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):
import re
import json
def removeunicode(text):
text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
text = re.sub(r'\\[u]\S\S\S\S', "", text)
return text
with open('stream_key.json', 'r') as f:
for line in f:
tweet = json.loads(line)
text = tweet['text']
text = removeunicode(text)
print(text)
The result i get is:
Traceback (most recent call last):
File "eraseunicode.py", line 17, in <module>
print(text)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>
As I already use function to remove the \u2019
before print, I don't understand why it is still error. Please Help. Thanks