Python: parsing UNICODE characters using bs4

Question

I am building a python3 web crawler/scraper using bs4. The program crashes whenever it meets a UNICODE code character like a Chinese symbol. How do I modify my scraper so that it supports UNICODE?

Here's the code:

import urllib.request
from bs4 import BeautifulSoup

def crawlForData(url):
        r = urllib.request.urlopen(url)
        soup = BeautifulSoup(r.read(),'html.parser')
        result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
        for p in result:
                print(p)

url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)

Your example seems to work for me fine. Are you sure this isn't a problem with the IDE you're running it on ? maybe your shell doesn't support unicode ? — AbdealiLoKo, Jan 05 '16 at 10:56
Do you use Python 3.x? The usage of the `print` function seems to indicate this, but I'd like to be sure. — Matthias, Jan 05 '16 at 11:00
the posted example works for me too! may be add an arbitrary print before printing ```p``` that way you know if it is a problem with the shell or IDE and not with parsing. It works in OSX Terminal and PyCharm — Alan Francis, Jan 05 '16 at 12:05
In fact I have got a weird issue now. The code doesn't even run if I try it from command line. Works in PyCharm IDE unless there are unicode characters. — , Jan 05 '16 at 14:53
"The program crashes". Show the full error traceback. My guess is it is a `UnicodeEncodeError` on `print(p)` and you are running on Windows and the terminal encoding is not UTF-8. The people using Linux are using UTF-8 and it works. Try `print(repr(p))` or use an IDE that supports UTF-8. — Mark Tolonen, Jan 06 '16 at 05:59
related: [A good way to get the charset/encoding of an HTTP response in Python](http://stackoverflow.com/q/14592762/4279) — jfs, Jan 06 '16 at 06:05
@MarkTolonen Yes, I am running on Windows and it is a `UnicodeEncodeError`. — , Jan 07 '16 at 01:33
Which line has the crash? Edit your question and show the full traceback of the error message. You are using Unicode strings by default in Python 3, so it is probably the console or IDE you are using on Windows that isn't supporting the characters you are trying to print. — Mark Tolonen, Jan 07 '16 at 02:32

score 1 · Answer 1 · answered Jan 05 '16 at 10:54

1

You can try unicode() method. It decodes unicode strings.

or a way to go is

content.decode('utf-8','ignore')

where content is your string

The complete solution may be:

html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

answered Jan 05 '16 at 10:54

tmac_balla

648
3
16

@MagicMysteryBro please accept this question if you find it helpful – tmac_balla Jan 06 '16 at 12:04

Python: parsing UNICODE characters using bs4

1 Answers1