Convert Python escaped Unicode sequences to UTF-8

Question

I'm using Beautiful Soup. It gets me the text of some HTML nodes, but those nodes have some Unicode characters, which get converted to escaped sequences in the string

For example, An HTML element that has this: 50 € is retrieved by Beautiful Soup like: soup.find("h2").text as this string: 50\u20ac, which is only readable in the Python console. But then it becomes unreadable when written to a JSON file. Note: I save to a JSON file using this code:

with open('file.json', 'w') as fp:
        json.dump(fileToSave, fp)

How can I convert those Unicode characters back to UTF-8 or whatever makes them readable again?

Have you tried : f = open('somefile', 'wb') and then f.write('your text') — Masoud Masoumi Moghadam, Sep 06 '17 at 16:36
What do you mean by **saved to JSON**? Are you returning the JSON to some other functions or are you writing it in a file? — chad, Sep 06 '17 at 16:39
Provide a [mcve]. *How* do you save it to JSON? Show the `repr()` of the content of the string. — Mark Tolonen, Sep 06 '17 at 17:17

score 4 · Accepted Answer · answered Sep 06 '17 at 17:31

4

Small demo using Python 3. If you don't dump to JSON using ensure_ascii=False, non-ASCII will be written to JSON with Unicode escape codes. That doesn't affect the ability to load the JSON, but it is less readable in the .json file itself.

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50\u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
...  json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z

Content of out.json (UTF-8-encoded):

"50€"

answered Sep 06 '17 at 17:31

Mark Tolonen

166,664
26
169
251

Thanks a lot! That worked, it's readable now. But how do I load it back correctly? Right now I use this code to load the file: json1_file = open(filename + '.json') json1_str = json1_file.read() file = json.loads(json1_str) but the characters aren't displayed correctly. I couldn't embed the code correctly in the comment, sorry about that. – Mohamed Oun Sep 06 '17 at 18:19
The JSON renders correctly now, but when loaded back to Python this is how it looks: `50â‚¬`. – Mohamed Oun Sep 06 '17 at 18:32
1

@MohamedOun Open the file with `encoding='utf8'`. It's not the default. – Mark Tolonen Sep 06 '17 at 20:42

score 2 · Answer 2 · answered Sep 06 '17 at 16:44

2

For Python 2.7, I think you can use codecs and json.dump(obj, fp, ensure_ascii=False). Example:

import codecs
import json

with codecs.open(filename, 'w', encoding='utf-8') as fp:
    # obj is a 'unicode' which contains "50 €"
    json.dump(obj, fp, ensure_ascii=False)

answered Sep 06 '17 at 16:44

pciang

301
1
6

@MohamedOun It works fine in Python3, but you've shown no example of what you are doing wrong so we can correct it. – Mark Tolonen Sep 06 '17 at 17:18
@MarkTolonen I have a dictionary where the values are strings that have unicode characters. I save that dict as a JSON file, but the unicode characters are displayed like `\u20ac` in it. Do you need more details? – Mohamed Oun Sep 06 '17 at 17:50

score 0 · Answer 3 · edited Aug 05 '22 at 17:38

0

Please try with the below:

utf8string = <unicodestring>.encode("utf-8")

edited Aug 05 '22 at 17:38

Peter Mortensen

30,738
21
105
131

answered Sep 06 '17 at 16:36

Dharmesh Fumakiya

2,276
2
11
17

1

The problem is, it returns a string, not a unicode string. Anyway, I tried encoding that string, but I can't save it to JSON because `Object of type 'bytes' is not JSON serializable`. – Mohamed Oun Sep 06 '17 at 16:42

Convert Python escaped Unicode sequences to UTF-8

3 Answers3