scrape with correct character encoding (python requests + beautifulsoup)

Question

I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106

It contains special characters such as umlauts. See here:

My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below:

The meta HTML tag defines the following charset on the pages:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>

I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:

r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")

If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.

r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")

still results in the following string shown on the console output in my python IDE:

Der WildlÃ¶wenpfleger

instead it should be

Der Wildlöwenpfleger

How can I change my code to parse the umlauts properly?

score 6 · Accepted Answer · answered Sep 16 '17 at 12:58

In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.

In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:

import requests

r = requests.get('http://fm4-archiv.at/files.php?cat=106')

>>> type(r.content)    # raw content
<class 'bytes'>
>>> type(r.text)       # decoded to unicode
<class 'str'>    
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'

>>> soup = BeautifulSoup(r.text, 'lxml')

That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:

>>> soup = BeautifulSoup(r.text, 'lxml')     # using decoded string... should work
>>> soup.find_all('a')[39]
<a href="details.php?file=1882">Der Wildlöwenpfleger</a>
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)

shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.

It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:

>>> r.content[8013:8070].decode('iso-8859-1')
'<a href="details.php?file=1882">Der WildlÃ¶wenpfleger</a>'

>>> r.content[8013:8070].decode('utf8')
'<a href="details.php?file=1882">Der Wildlöwenpfleger</a>'

but the second link uses ISO-8859-1 encoding:

>>> r.content[2868:3132].decode('iso-8859-1')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon übergeben. Auf Streifzügen durch die Popliteratur stößt Hermes auf deren große Themen und hört mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

>>> r.content[2868:3132].decode('utf8', 'replace')
'<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)\r\n</a>'

Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.

beta · Answer 2 · 2017-09-16T11:34:15.563

I just found two solutions. Can you confirm?

Soup = BeautifulSoup(r.content.decode('utf-8','ignore'),"lxml")

and

Soup = BeautifulSoup(r.content,"lxml", fromEncoding='utf-8')

Both results in the following example output:

Der Wildlöwenpfleger

EDIT: I just wonder why these work, because r.encoding results in UTF-8 anyway. This tells me that requests anyway handled the data as UTF-8 data. Hence I wonder why .decode('utf-8','ignore') or fromEncoding='utf-8' result in the desired output?

EDIT 2: okay, I think I get it now. The .decode('utf-8','ignore') and fromEncoding='utf-8' mean that the actual data is encoded as UTF-8 and that Beautifulsoup should parse it handling it as UTF-8 encoded data which is actually the case.

I assume that requests correctly handled it as UTF-8, but BeautifulSoup did not. Hence, I have to do this extra decoding.

soup = BeautifulSoup(response.content, 'html.parser'), It works for me — giasuddin, Dec 09 '19 at 12:13
@giasuddin for me it also worked, but some utf-8 sites got broken, so i used condition, if response is utf-8 use .text else .content — luky, Nov 08 '20 at 10:55

scrape with correct character encoding (python requests + beautifulsoup)

2 Answers2

Linked