Scraping french site and getting the UnicodeEncodeError

Question

I am scraping a site and getting info from links on that site, however, many of the links contain accents/french characters. I am unable to get the links for these pages therefore not able to scrape them.

This is the part of the code that gets URLs from start pages

def parse(self, response):

  subURLs = []
  
  partialURLs = response.css('.directory_name::attr(href)').extract()
  

  for i in partialURLs:

   yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)

And This is the Error that I am getting in the log

yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 58: ordinal not in range(128)

Any help is appreciated! Thank you!

malberts · Accepted Answer · 2019-02-23T06:30:04.790

1

Don't use str() to convert that value. Read more about that here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

However, there is a better way to create URLs like that using Scrapy's built-in urljoin:

yield response.follow(response.urljoin(i), self.parse_dealers)

This will automatically create the full URL based on the current URL plus the relative path.

edited Feb 23 '19 at 06:30

answered Feb 23 '19 at 06:18

malberts

2,488
1
11
16

Thank you Albert, that was super helpful!! – Minjia Zhu Feb 25 '19 at 18:48

Scraping french site and getting the UnicodeEncodeError

1 Answers1