3

I'm using Beautiful Soup. It gets me the text of some HTML nodes, but those nodes have some Unicode characters, which get converted to escaped sequences in the string

For example, An HTML element that has this: 50 € is retrieved by Beautiful Soup like: soup.find("h2").text as this string: 50\u20ac, which is only readable in the Python console. But then it becomes unreadable when written to a JSON file. Note: I save to a JSON file using this code:

with open('file.json', 'w') as fp:
        json.dump(fileToSave, fp)

How can I convert those Unicode characters back to UTF-8 or whatever makes them readable again?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mohamed Oun
  • 561
  • 1
  • 9
  • 24

3 Answers3

4

Small demo using Python 3. If you don't dump to JSON using ensure_ascii=False, non-ASCII will be written to JSON with Unicode escape codes. That doesn't affect the ability to load the JSON, but it is less readable in the .json file itself.

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '<element>50\u20ac</element'
>>> html
'<element>50€</element'
>>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
...  json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z

Content of out.json (UTF-8-encoded):

"50€"
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks a lot! That worked, it's readable now. But how do I load it back correctly? Right now I use this code to load the file: json1_file = open(filename + '.json') json1_str = json1_file.read() file = json.loads(json1_str) but the characters aren't displayed correctly. I couldn't embed the code correctly in the comment, sorry about that. – Mohamed Oun Sep 06 '17 at 18:19
  • The JSON renders correctly now, but when loaded back to Python this is how it looks: `50€`. – Mohamed Oun Sep 06 '17 at 18:32
  • 1
    @MohamedOun Open the file with `encoding='utf8'`. It's not the default. – Mark Tolonen Sep 06 '17 at 20:42
2

For Python 2.7, I think you can use codecs and json.dump(obj, fp, ensure_ascii=False). Example:

import codecs
import json

with codecs.open(filename, 'w', encoding='utf-8') as fp:
    # obj is a 'unicode' which contains "50 €"
    json.dump(obj, fp, ensure_ascii=False)
pciang
  • 301
  • 1
  • 6
  • @MohamedOun It works fine in Python3, but you've shown no example of what you are doing wrong so we can correct it. – Mark Tolonen Sep 06 '17 at 17:18
  • @MarkTolonen I have a dictionary where the values are strings that have unicode characters. I save that dict as a JSON file, but the unicode characters are displayed like `\u20ac` in it. Do you need more details? – Mohamed Oun Sep 06 '17 at 17:50
0

Please try with the below:

utf8string = <unicodestring>.encode("utf-8")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Dharmesh Fumakiya
  • 2,276
  • 2
  • 11
  • 17
  • 1
    The problem is, it returns a string, not a unicode string. Anyway, I tried encoding that string, but I can't save it to JSON because `Object of type 'bytes' is not JSON serializable`. – Mohamed Oun Sep 06 '17 at 16:42