I wrote a Wikipedia scraper in Python last week.
It scrapes French pages, so I must manage UTF-8 encoding to avoid errors. I did this with these lines at the beginning of my script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
I also encode the scraped string like this:
adresse = monuments[1].get_text().encode('utf-8')
My first script worked perfectly fine with Python 2.7, but I rewrote it for Python 3 (especially to use urllib.request) and UTF-8 doesn't work anymore.
I got these errors after scraping the first few elements:
File "scraper_monu_historiques_ge_py3.py", line 19, in <module>
url = urllib.request.urlopen(url_ville).read() # et on ouvre chacune d'entre elles
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 455, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 473, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1217, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.4/urllib/request.py", line 1174, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 975, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 58: ordinal not in range(128)
I don't understand why, because it worked fine in Python 2.7... I published a version of this WIP on Github.