0

Background:

I am planning on buying a car, and want to monitor the prices. I'd like to use Scrapy to do this for me. However the site, blocks my code from doing this.

MWE/Code:

#!/usr/bin/python3

# from bs4 import BeautifulSoup
import scrapy    # adding scrapy to our file

urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']

class HeadphoneSpider(scrapy.Spider):   # our class inherits from scrapy.Spider
    name = "headphones"
    def start_requests(self):
        urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
        # urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']

        for url in urls:            
            yield scrapy.Request(url=url, callback=self.parse)  # we will explain the callback soon


    def parse(self, response):
        img_urls = response.css('img::attr(src)').extract()
        with open('urls.txt', 'w') as f:
            for u in img_urls:
                f.write(u + "\n")


    def main():
        scraper()

Output:

   ...some stuff above it
   2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
   ..some more stuff underneath

Question:

I just don't know how I can circumnavigate this not allowed to parse the prices, Km's etc. It would make my life so much easier. How can I get past this block? FWIW I also tried it with BeautifulSoup which didn't work.

3kstc
  • 1,871
  • 3
  • 29
  • 53
  • Does this answer your question? [Scraping in Python - Preventing IP ban](https://stackoverflow.com/questions/35133200/scraping-in-python-preventing-ip-ban) – ggorlen Jan 09 '20 at 23:49
  • @ggorlen I should use Scrapy? It seems like another level after BeautifulSoup - help!~? – 3kstc Jan 10 '20 at 00:08
  • I don't know, but there are many dupes of this question and articles around the web on this exact problem, so I think you'll need to show a bit more research before this question is likely to receive useful attention as there are a wide variety of general techniques (described in the dupe and other threads) that may help you. – ggorlen Jan 10 '20 at 00:17
  • @ggorlen I'm trying to use Scrapy, but I'm getting the `not allowed` card... :/ – 3kstc Jan 10 '20 at 00:48

1 Answers1

0

There are multiple ways to avoid being blocked by the sites while scraping that:

  • Set ROBOTSTXT_OBEY = False
  • Increase DOWNLOAD_DELAY between your requests like 3 to 4 seconds depending upon the site behavior
  • Set CONCURRENT_REQUESTS to 1
  • Use proxies or pool of proxies by customizing proxy_middleware and serve the cause
  • Carry site cookies in requests so the site does not identify bot behavior

You can try these solutions sequentially

Ahmed Buksh
  • 161
  • 8
  • where would i write these commands? Arre you able to provide a snippet of code please? I am new to `scrapy` – 3kstc Jan 18 '20 at 00:36
  • You have to write these in the settings file. Please refer to Scrapy settings docs, it will give you a clear idea on how to use these settings [https://docs.scrapy.org/en/latest/topics/settings.html] – Ahmed Buksh Jan 19 '20 at 07:17