2

I am having an encoding issue, when making the exact same request from my spider on the one side, and from the scrapy shell on the other side, the responses I get are not in the same encoding.

I.e. when scraping using my spider:

def parse(self, response):
    print(response.headers[b'Content-Type'])

b'text/html; charset=utf-8'

Whereas when using the scrapy shell:

scrapy shell https://www.agoravox.fr/tribune-libre/article/attentat-contre-charlie-hebdo-161711
>>> response.headers[b'Content-Type']

b'text/html; charset=iso-8859-1'

And this is highly problematic as the page is encoded in iso-8859-1, therefore I'm getting unicode replacement characters while scraping from my spider afterwards. Any ideas?

Thank you

taco
  • 21
  • 1
  • Try specifying a browser type (user agent) in your headers, as in https://stackoverflow.com/questions/54699365/adding-headers-to-scrapy-spider it may change the results – B. Go Nov 16 '19 at 22:21
  • @B.Go it didn't work, the response header is still in utf8 and the replacement characters are still present – taco Nov 17 '19 at 01:02
  • it was worth trying. With several headers from different browsers... https://github.com/scrapy/scrapy/issues/2154 may also help? Or https://stackoverflow.com/questions/1495627/how-to-download-any-webpage-with-correct-charset-in-python or maybe you could convert the page / lie about its encoding... – B. Go Nov 17 '19 at 16:43
  • https://docs.scrapy.org/en/latest/topics/request-response.html says also that the request encoding is outside of the header – B. Go Nov 17 '19 at 16:45

1 Answers1

0

Regardless of the reason why you are getting a different response header in different scenarios, if the response consistenly uses an encoding (ISO-8859-1) that not always matches the Content-Type response header, read the response body as bytes from response.body and decode it with .decode('iso-8859-1').

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
  • Well this is the main issue I'm having here, I get replacement characters when doing this. When using decode('ISO-8859-1'), I get a bunch of �d, when using decode('utf-8'), I get some � – taco Nov 21 '19 at 16:44
  • Where do you get those “replacement characters”? If you write `response.body` into a file in `rb` mode, and you open that file with a plan text editor, which encoding does the file use according to the plain text editor? – Gallaecio Nov 21 '19 at 16:57
  • I get this error in my terminal (the same in which I don't have issues while using scrapy shell). When writing the response.body directly to a file in wb mode, my text editor loads it in utf8 and warns me there are encoding problems. – taco Nov 22 '19 at 13:51