1

The script below returns 'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'

and I cant find a good explanation for it.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

results = {}

for page_num in range(0, 1000, 20):
    address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører' 

    html = urlopen(address)
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find_all(class_='table-condensed')
    output = pd.read_html(str(table))[0]
    results[page_num] = output


df = pd.concat([v for v in results.values()], axis = 0)
Alan Kavanagh
  • 9,425
  • 7
  • 41
  • 65
NRVA
  • 507
  • 3
  • 20

1 Answers1

1

You are using the std library to open the url. This library forces the address to be encoded into ascii. Hence non ascii characters like ø will throw a Unicode Error.

Line 1116-1117 of http/client.py

    # Non-ASCII characters should have been eliminated earlier
    self._output(request.encode('ascii'))

As alternative to urllib.request, the 3rd party requests is great.

import requests

address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text
Adam Holloway
  • 413
  • 4
  • 7