3

When I try to scrape a wikipedia site with a special character in its URL, using urllib.request and Python, I get the following error UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 23: ordinal not in range(128)

The code:

# -*- coding: utf-8 -*-
import urllib.request as ur

url = "https://no.wikipedia.org/wiki/Jonas_Gahr_Støre"
r = ur.urlopen(url).read()

How can I use urllib.request with utf-8 encoding?

Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93
bjornasm
  • 2,211
  • 7
  • 37
  • 62

4 Answers4

1

Apparently, urllib can only handle ASCII requests, and converting your url to ascii gives a error on your special character. Replacing ø with %C3%B8, the proper way to encode this special character in http, seems to do the trick. However, I can't find a method to do this automatically like your browser does.

example:

>>> f="https://no.wikipedia.org/wiki/Jonas_Gahr_St%C3%B8re"
>>> import urllib.request
>>> g=urllib.request.urlopen(f)
>>> text=g.read()
>>> text[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="nb" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

The answer above doesn't work, because he is encoding after the request is processed, while you get an error during the request processing.

mousetail
  • 7,009
  • 4
  • 25
  • 45
  • Thank you. I could write a manual converter for the characters I need. – bjornasm Aug 30 '16 at 14:15
  • There _is_ a function to quote a URL as the question this one .is a duplicate of shows. – ivan_pozdeev Aug 30 '16 at 15:47
  • yes, but urlquote will escape out the slashes in the url as well as the special characters – mousetail Aug 30 '16 at 17:09
  • It won't escape slashes by default (but will escape the colon). There are [`urlparse.urlsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlsplit) and [`urlparse.urlunsplit`](https://docs.python.org/2/library/urlparse.html#urlparse.urlunsplit) to only process the specific part. – ivan_pozdeev Aug 30 '16 at 21:20
1

If using a library is an option, I would suggest the awesome requests

# -*- coding: utf-8 -*-
import requests
r = requests.get('https://no.wikipedia.org/wiki/Jonas_Gahr_Støre')
print(r.text)
Yohan D
  • 930
  • 2
  • 7
  • 24
0

New plan - Using requests

from bs4 import BeautifulSoup
import requests

def scrape():
    url = "http://no.wikipedia.org/wiki/Jonas_Gahr_Støre"
    r = requests.get(url).content
    soup = BeautifulSoup(r).encode('utf-8')

    print soup

    print r

if __name__ == '__main__':
    scrape()
Daniel Lee
  • 7,189
  • 2
  • 26
  • 44
-2

Using the answer from @mousetail I wrote a custom encoder for the characters I needed:

def properEncode(url):
  url = url.replace("ø", "%C3%B8")
  url = url.replace("å", "%C3%A5")
  url = url.replace("æ", "%C3%A6")
  url = url.replace("Ø", "%C3%98")
  url = url.replace("Å", "%C3%A5")
  url = url.replace("Æ", "%C3%85")
  return url
Community
  • 1
  • 1
bjornasm
  • 2,211
  • 7
  • 37
  • 62