1

When I check the code by Chrome DevTools the text is ok but once scraped I have character errors.

e.g. In the code below the h1 should return "Valerian e la città dei mille pianet" and not "Valerian e la cittàdei mille pianeti".

These character errors are repeating when scraping any text on this domain.

I don't understand why, as in other webs this code works perfectly.

#  -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('http://www.mymovies.it/film/2017/valerian/', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

title = soup.find('h1').get_text()

print(title)
Jean-François Corbett
  • 37,420
  • 30
  • 139
  • 188
Awco
  • 21
  • 4
  • Are you sure the website is using UTF-8? That looks like latin-1 mojibake to me. Or vice versa: are you sure your output streams are utf-8? – Max Sep 20 '17 at 19:57
  • In the Header a meta is , but there is maybe a character conflict somewhere. I don't know :( – Awco Sep 20 '17 at 20:10
  • Possibly related: https://github.com/requests/requests/issues/1604 – unutbu Sep 20 '17 at 20:21
  • I tried on both computer I usually use to run my scripts (on utf-8), same result. I only have the problem with this webpage in particular. – Awco Sep 20 '17 at 20:23
  • Some workarounds are discussed in [this github issue](https://github.com/requests/requests/issues/1604), and also [in this SO question](https://stackoverflow.com/q/36453359/190597). Note also that the behavior of `requests` will [change in version 3.0.0](https://github.com/requests/requests/issues/2086). – unutbu Sep 20 '17 at 20:50
  • thx you @unutbu – Awco Sep 20 '17 at 21:01

1 Answers1

1

Solved!

I checked the @unutbu link and I forced to encode the request to utf-8 even having it defined in the header.

response.encoding = 'utf-8'
Awco
  • 21
  • 4