BTS is scraping a text with utf8 errors but it's looking good on the original webpage

Question

When I check the code by Chrome DevTools the text is ok but once scraped I have character errors.

e.g. In the code below the h1 should return "Valerian e la città dei mille pianet" and not "Valerian e la cittÃ dei mille pianeti".

These character errors are repeating when scraping any text on this domain.

I don't understand why, as in other webs this code works perfectly.

#  -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('http://www.mymovies.it/film/2017/valerian/', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

title = soup.find('h1').get_text()

print(title)

Are you sure the website is using UTF-8? That looks like latin-1 mojibake to me. Or vice versa: are you sure your output streams are utf-8? — Max, Sep 20 '17 at 19:57
In the Header a meta is , but there is maybe a character conflict somewhere. I don't know :( — Awco, Sep 20 '17 at 20:10
Possibly related: https://github.com/requests/requests/issues/1604 — unutbu, Sep 20 '17 at 20:21
I tried on both computer I usually use to run my scripts (on utf-8), same result. I only have the problem with this webpage in particular. — Awco, Sep 20 '17 at 20:23
Some workarounds are discussed in [this github issue](https://github.com/requests/requests/issues/1604), and also [in this SO question](https://stackoverflow.com/q/36453359/190597). Note also that the behavior of `requests` will [change in version 3.0.0](https://github.com/requests/requests/issues/2086). — unutbu, Sep 20 '17 at 20:50

score 1 · Answer 1 · answered Sep 20 '17 at 20:58

1

Solved!

I checked the @unutbu link and I forced to encode the request to utf-8 even having it defined in the header.

response.encoding = 'utf-8'

answered Sep 20 '17 at 20:58

Awco

21
4

BTS is scraping a text with utf8 errors but it's looking good on the original webpage

1 Answers1