1

I'm using BeautifulSoup to parse a bunch of web pages which I downloaded locally using WGet.

I'm reading in the file like this:

file = open(file_name, 'r', encoding='utf-8').read()
soup = BeautifulSoup(file, 'html5lib')

I'm using this soup object to get text, which I am then writing to a .json file like this:

f.write('"text": "' + str(text.encode('utf-8')) )

However, when I open the .json file I see strings like this:

and\xe2\x80\x94in spite of

He hadn\xe2\x80\x99t shaved in a few days at least

and Michael can go.\xe2\x80\x9d\xc2\xa0 Her voice

I get that these weird characters are not UTF-8 so python doesn't know what to do with them. But I don't know how to fix this.

Thanks for any help.

EDIT: I'm using python3

Also, if I remove the part where I encode the text before I write it, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 264: ordinal not in range(128)

James Dorfman
  • 1,740
  • 6
  • 18
  • 36
  • Are you opening the file as UTF-8 encoded? – Oluwafemi Sule Aug 12 '17 at 14:48
  • It looks like you're using Python 3. You should always mention the Python version with Unicode questions, since Python 2 & 3 have big differences in that area. But anyway, those hex sequences like `\xe2\x80\x94` are actually valid UTF-8 multibyte sequences. When properly decode, they become `and—in spite of` `He hadn’t shaved in a few days at least` `and Michael can go.”  Her voice`. I used this code to perform that transformation: `s.encode('latin1').decode()`. But I don't know BeautifulSoup, so I can't tell you the proper way to fix this. – PM 2Ring Aug 12 '17 at 14:57
  • Suggested reading: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ – Mark Tolonen Aug 12 '17 at 16:14
  • Also: https://nedbatchelder.com/text/unipain.html – Mark Tolonen Aug 12 '17 at 16:14

2 Answers2

3

With str(text.encode('utf-8')) you get:

>>> text = 'He hadn’t shaved in a few days'
>>> text.encode('utf8')
b'He hadn\xe2\x80\x99t shaved in a few days'
>>> str(text.encode('utf8'))
"b'He hadn\\xe2\\x80\\x99t shaved in a few days'"
>>> print(str(text.encode('utf8')))
b'He hadn\xe2\x80\x99t shaved in a few days'

So you are getting exactly what you unintentionally wrote to the file.

Instead of manually building the JSON, use the json module. Given UTF-8-encoded input of:

<html>
<p>He hadn’t shaved in a few days</p>
</html>

Then:

from bs4 import BeautifulSoup
import json

# Good practice:
# Decode text data to Unicode when read into a program.
# Process text as Unicode in the program.
# Encoded text when leaving the program, such as:
#    Writing to database.
#    Sending over a network socket.
#    Writing to a file.

# Read the content as Unicode text.
with open('test.html','r',encoding='utf8') as file:
    content = file.read()
soup = BeautifulSoup(content)
text = soup.find('p').text    # Unicode string!

# Build the dictionary to be written in JSON format.
# Leave as Unicode!
items = {'text':text}

# Output as UTF-8-encoded data.
#
# ensure_ascii=False makes the non-ASCII characters in the file readable,
# but it works without it.  The file will just have Unicode escapes.
#
with open('out.json','w',encoding='utf8') as out:
    json.dump(items,out,ensure_ascii=False)


# Read and decode the data back from the file and turn it back into 
# a dictionary.
with open('out.json','r',encoding='utf8') as file:
    data = json.load(file)

print(data)

Output (Python dict):

{'text': 'He hadn’t shaved in a few days'}

Content of file when ensure_ascii=True:

{"text": "He hadn’t shaved in a few days"}

Content of file when ensure_ascii=False:

{"text": "He hadn\u2019t shaved in a few days"}
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I tried that, but this gives me the following error: json.dump(items,f,ensure_ascii=False) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py", line 179, in dump fp.write(chunk) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 258: ordinal not in range(128) – James Dorfman Aug 13 '17 at 13:48
  • Nevermind. I was able to fix that by using codecs.open(file_name,'w', encoding="utf-8") to open the file that I was writing to – James Dorfman Aug 13 '17 at 13:55
0

Simplify your write: f.write('"text": "' + text) (or f.write('"text": "' + soup.prettify()). You were encoding material that was already encoded.

Use version 4.6.0: https://pypi.python.org/pypi/beautifulsoup4/

Use python3 -- you will find the str diagnostics more helpful than in python2, they offer better guidance about when to encode or decode.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • I assume the OP is already using Python 3, since `open(file_name, 'r', encoding='utf-8')` doesn't work in Python 2; at least, the standard `open` built-in function doesn't support an `encoding` keyword arg in Python 2 (although there are other `open`s that do). – PM 2Ring Aug 12 '17 at 14:58
  • If I prettify the soup, it turns into a string. I didn't show this in the question, but the text is fetched from HTML tags, which is why I need the actual soup object. Additionally, I tried removing the encoding when I wrote the text, but it created an error, which I just edited into the original question. – James Dorfman Aug 12 '17 at 15:19
  • You didn't show us how you `open`'d `f`. It sounds like your `open` chose a (default) ascii codec instead of utf8 codec. – J_H Aug 12 '17 at 16:17
  • 1
    Mark Tolonen's code is very nice. Perhaps the best part is the comment block. Do be sure to follow the "Good practice" advice. You can view `type(text)` if you're ever unsure what sort of object you have at the moment. Also call encode or decode and view the type of that result. – J_H Aug 12 '17 at 16:19