Background:
I am planning on buying a car, and want to monitor the prices.
I'd like to use Scrapy
to do this for me. However the site, blocks my code from doing this.
MWE/Code:
#!/usr/bin/python3
# from bs4 import BeautifulSoup
import scrapy # adding scrapy to our file
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
name = "headphones"
def start_requests(self):
urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
# urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse) # we will explain the callback soon
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
def main():
scraper()
Output:
...some stuff above it
2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
..some more stuff underneath
Question:
I just don't know how I can circumnavigate this not allowed
to parse the prices, Km's etc. It would make my life so much easier. How can I get past this block? FWIW I also tried it with BeautifulSoup which didn't work.