How to get around being blocked with "Scrapy"

Question

Background:

I am planning on buying a car, and want to monitor the prices. I'd like to use Scrapy to do this for me. However the site, blocks my code from doing this.

MWE/Code:

#!/usr/bin/python3

# from bs4 import BeautifulSoup
import scrapy    # adding scrapy to our file

urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']

class HeadphoneSpider(scrapy.Spider):   # our class inherits from scrapy.Spider
    name = "headphones"
    def start_requests(self):
        urls = ['https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/']# list to enter our urls
        # urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']

        for url in urls:            
            yield scrapy.Request(url=url, callback=self.parse)  # we will explain the callback soon


    def parse(self, response):
        img_urls = response.css('img::attr(src)').extract()
        with open('urls.txt', 'w') as f:
            for u in img_urls:
                f.write(u + "\n")


    def main():
        scraper()

Output:

   ...some stuff above it
   2020-01-10 00:37:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.carsales.com.au/cars/volkswagen/golf/7-series/wagon-bodystyle/diesel-fueltype/>: HTTP status code is not handled or not allowed
   ..some more stuff underneath

Question:

I just don't know how I can circumnavigate this not allowed to parse the prices, Km's etc. It would make my life so much easier. How can I get past this block? FWIW I also tried it with BeautifulSoup which didn't work.

Does this answer your question? [Scraping in Python - Preventing IP ban](https://stackoverflow.com/questions/35133200/scraping-in-python-preventing-ip-ban) — ggorlen, Jan 09 '20 at 23:49
@ggorlen I should use Scrapy? It seems like another level after BeautifulSoup - help!~? — 3kstc, Jan 10 '20 at 00:08
I don't know, but there are many dupes of this question and articles around the web on this exact problem, so I think you'll need to show a bit more research before this question is likely to receive useful attention as there are a wide variety of general techniques (described in the dupe and other threads) that may help you. — ggorlen, Jan 10 '20 at 00:17
@ggorlen I'm trying to use Scrapy, but I'm getting the `not allowed` card... :/ — 3kstc, Jan 10 '20 at 00:48

score 0 · Answer 1 · edited Jun 07 '21 at 14:39

0

There are multiple ways to avoid being blocked by the sites while scraping that:

Set ROBOTSTXT_OBEY = False
Increase DOWNLOAD_DELAY between your requests like 3 to 4 seconds depending upon the site behavior
Set CONCURRENT_REQUESTS to 1
Use proxies or pool of proxies by customizing proxy_middleware and serve the cause
Carry site cookies in requests so the site does not identify bot behavior

You can try these solutions sequentially

edited Jun 07 '21 at 14:39

DisappointedByUnaccountableMod

6,656
4
18
22

answered Jan 17 '20 at 14:08

Ahmed Buksh

161
8

where would i write these commands? Arre you able to provide a snippet of code please? I am new to `scrapy` – 3kstc Jan 18 '20 at 00:36
You have to write these in the settings file. Please refer to Scrapy settings docs, it will give you a clear idea on how to use these settings [https://docs.scrapy.org/en/latest/topics/settings.html] – Ahmed Buksh Jan 19 '20 at 07:17

How to get around being blocked with "Scrapy"

1 Answers1