2

I wrote a Wikipedia scraper in Python last week.

It scrapes French pages, so I must manage UTF-8 encoding to avoid errors. I did this with these lines at the beginning of my script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

I also encode the scraped string like this:

adresse = monuments[1].get_text().encode('utf-8')

My first script worked perfectly fine with Python 2.7, but I rewrote it for Python 3 (especially to use urllib.request) and UTF-8 doesn't work anymore.

I got these errors after scraping the first few elements:

File "scraper_monu_historiques_ge_py3.py", line 19, in <module>
    url = urllib.request.urlopen(url_ville).read() # et on ouvre chacune d'entre elles
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 455, in open
    response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 473, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
    result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1217, in https_open
    context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.4/urllib/request.py", line 1174, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 975, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 58: ordinal not in range(128)

I don't understand why, because it worked fine in Python 2.7... I published a version of this WIP on Github.

dkasak
  • 2,651
  • 17
  • 26
Raphadasilva
  • 565
  • 1
  • 6
  • 21
  • 4
    The `# -*- coding: utf-8 -*-` directive has no effect on how your script processes Unicode data, it just tells the Python interpreter about the encoding of the text in the script itself. You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Oct 20 '16 at 19:08
  • 1
    This code `monuments[1].get_text().encode('utf-8')` converts unicode to utf-8. I doubt that is what you want at that point, but it is impossible to know, because there is no other code here. You also did not say in which line you get the error, nor which are the values of the variables at that point. – zvone Oct 20 '16 at 19:12
  • My bad ! The error is in line 19 : `url = urllib.request.urlopen(url_ville).read()` I think it's related to the liste depart, but it worked fine in Python 2.7... – Raphadasilva Oct 20 '16 at 19:19
  • Python 2 and 3 handle strings and their encoding very differently, so it is entirely possible that it fails on other but passes on another. – Teemu Risikko Oct 20 '16 at 19:21
  • 1
    `url_ville` (or just its `ville` part) needs to be encoded as per the accepted answer in http://stackoverflow.com/questions/4389572/how-to-fetch-a-non-ascii-url-with-python-urlopen. – wrwrwr Oct 20 '16 at 20:02

1 Answers1

3

You are passing a string which contain non-ASCII characters to urllib.urlopen, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).

You need to make the IRI a valid URI before passing it to urlopen. The specifics of this depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.

Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri, you can fix it using the following:

import urllib.parse
import urllib.request

split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2])    # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)

urllib.request.urlopen(url).read()

However, if you can avoid urllib and have the option of using the requests library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.

dkasak
  • 2,651
  • 17
  • 26