Encoding error when reading url with urllib

Question

When I try to scrape a wikipedia site with a special character in its URL, using urllib.request and Python, I get the following error UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 23: ordinal not in range(128)

The code:

# -*- coding: utf-8 -*-
import urllib.request as ur

url = "https://no.wikipedia.org/wiki/Jonas_Gahr_Støre"
r = ur.urlopen(url).read()

How can I use urllib.request with utf-8 encoding?

tried writing this at the begging of the file? # -*- coding: utf-8 -*- — Swakeert Jain, Aug 30 '16 at 13:56
Python version? Unicode handling changed drastically changed between 2.x and 3.x family. — Łukasz Rogalski, Aug 30 '16 at 14:02
AttributeError: 'str' object has no attribute 'decode' @Jean-FrançoisFabre — bjornasm, Aug 30 '16 at 14:08
Possible duplicate of [URL with Ukrainian characters giving UnicodeEncodeError](http://stackoverflow.com/questions/30260993/url-with-ukrainian-characters-giving-unicodeencodeerror) — ivan_pozdeev, Aug 30 '16 at 15:44

score 1 · Accepted Answer · answered Aug 30 '16 at 14:12

1

Apparently, urllib can only handle ASCII requests, and converting your url to ascii gives a error on your special character. Replacing ø with %C3%B8, the proper way to encode this special character in http, seems to do the trick. However, I can't find a method to do this automatically like your browser does.

example:

>>> f="https://no.wikipedia.org/wiki/Jonas_Gahr_St%C3%B8re"
>>> import urllib.request
>>> g=urllib.request.urlopen(f)
>>> text=g.read()
>>> text[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="nb" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

The answer above doesn't work, because he is encoding after the request is processed, while you get an error during the request processing.

answered Aug 30 '16 at 14:12

mousetail

7,009
4
25
45

Thank you. I could write a manual converter for the characters I need. – bjornasm Aug 30 '16 at 14:15
There _is_ a function to quote a URL as the question this one .is a duplicate of shows. – ivan_pozdeev Aug 30 '16 at 15:47
yes, but urlquote will escape out the slashes in the url as well as the special characters – mousetail Aug 30 '16 at 17:09
It won't escape slashes by default (but will escape the colon). There are [`urlparse.urlsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlsplit) and [`urlparse.urlunsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlunsplit) to only process the specific part. – ivan_pozdeev Aug 30 '16 at 21:20

score 1 · Answer 2 · answered Aug 30 '16 at 14:12

1

If using a library is an option, I would suggest the awesome requests

# -*- coding: utf-8 -*-
import requests
r = requests.get('https://no.wikipedia.org/wiki/Jonas_Gahr_Støre')
print(r.text)

answered Aug 30 '16 at 14:12

Yohan D

930
2
7
24

Daniel Lee · Answer 3 · 2016-08-30T14:18:01.437

0

New plan - Using requests

from bs4 import BeautifulSoup
import requests

def scrape():
    url = "http://no.wikipedia.org/wiki/Jonas_Gahr_Støre"
    r = requests.get(url).content
    soup = BeautifulSoup(r).encode('utf-8')

    print soup

    print r

if __name__ == '__main__':
    scrape()

edited Aug 30 '16 at 14:18

answered Aug 30 '16 at 13:55

Daniel Lee

7,189
2
26
44

Thank you for your answer - however, I sadly get the same error. – bjornasm Aug 30 '16 at 13:58

score -2 · Answer 4 · edited May 23 '17 at 11:53

-2

Using the answer from @mousetail I wrote a custom encoder for the characters I needed:

def properEncode(url):
  url = url.replace("ø", "%C3%B8")
  url = url.replace("å", "%C3%A5")
  url = url.replace("æ", "%C3%A6")
  url = url.replace("Ø", "%C3%98")
  url = url.replace("Å", "%C3%A5")
  url = url.replace("Æ", "%C3%85")
  return url

edited May 23 '17 at 11:53

Community

1
1

answered Aug 30 '16 at 14:20

bjornasm

2,211
7
37
62

http://stackoverflow.com/questions/39229439/encoding-error-when-reading-url-with-urllib#comment65800916_39229882 – ivan_pozdeev Aug 30 '16 at 15:51

Encoding error when reading url with urllib

4 Answers4