1

I'm trying to extract dictionary entry:

url = 'http://www.lingvo.ua/uk/Interpret/uk-ru/вікно'
# parsed_url = urlparse(url)
# parameters = parse_qs(parsed_url.query)
# url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text = xmldata.xpath(//div[@class="js-article-html g-card"])

either with commented lines on or off, it keeps getting an error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-28: ordinal not in range(128)
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
Khrystyna
  • 123
  • 2
  • 9
  • I doubt that is coming from the commented lines: it's almost certainly coming from the `decode('utf-8')` call, as would be clear if you'd posted the traceback. Why do you need that line? What happens if you remove it? – Daniel Roseman May 15 '15 at 13:51
  • @DanielRoseman nothing changes, the same error. I've had the same problem here http://stackoverflow.com/questions/29435893/python-3-4-0-ascii-codec-cant-encode-characters-in-position-11-15-ordinal, but now I'm using different url with no parameters (that's why I commented those lines). Still don't know the answer – Khrystyna May 15 '15 at 14:42
  • @MartinPieters may I ask for your help? You've already helped once in here http://stackoverflow.com/questions/29435893/python-3-4-0-ascii-codec-cant-encode-characters-in-position-11-15-ordinal – Khrystyna May 15 '15 at 14:52
  • @cpburnz didn't work out( – Khrystyna May 15 '15 at 14:55
  • 2
    You should make a habit of including the full traceback in your questions. Until now I thought the issue was with your `.decode`. The problem is the URL... – That1Guy May 15 '15 at 15:14

1 Answers1

2

Your issue is that you have non-ASCII characters within your URL path which must be properly encoded using urllib.parse.quote(string) in Python 3 or urllib.quote(string) in Python 2.

# Python 3
import urllib.parse
url = 'http://www.lingvo.ua' + urllib.parse.quote('/uk/Interpret/uk-ru/вікно')

# Python 2
import urllib
url = 'http://www.lingvo.ua' + urllib.quote(u'/uk/Interpret/uk-ru/вікно'.encode('UTF-8'))

NOTE: According to What is the proper way to URL encode Unicode characters?, URLs should be encoded as UTF-8. However, that does not preclude percent encoding the resulting non-ASCII, UTF-8 characters.

Community
  • 1
  • 1
Uyghur Lives Matter
  • 18,820
  • 42
  • 108
  • 144
  • It's better to use [`urlparse.urlsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlsplit) and [`urlparse.urlunsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlunsplit) to only process the specific part. Will also allow to handle things like IDNs (`.encode("idna")`). – ivan_pozdeev Aug 30 '16 at 21:24