0

I am scraping a site and getting info from links on that site, however, many of the links contain accents/french characters. I am unable to get the links for these pages therefore not able to scrape them.

This is the part of the code that gets URLs from start pages

def parse(self, response):

  subURLs = []
  
  partialURLs = response.css('.directory_name::attr(href)').extract()
  

  for i in partialURLs:

   yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)

And This is the Error that I am getting in the log

yield response.follow('https://wheelsonline.ca/' + str(i), self.parse_dealers)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 58: ordinal not in range(128)

Any help is appreciated! Thank you!

1 Answers1

1

Don't use str() to convert that value. Read more about that here: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

However, there is a better way to create URLs like that using Scrapy's built-in urljoin:

yield response.follow(response.urljoin(i), self.parse_dealers)

This will automatically create the full URL based on the current URL plus the relative path.

malberts
  • 2,488
  • 1
  • 11
  • 16