Questions tagged [scrapy-middleware]

Scrapy middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders.

Scrapy also provides some built in Middlewares out of the box for use with your spiders.

23 questions
5
votes
2 answers

Scrapy FakeUserAgentError: Error occurred during getting browser

I use Scrapy FakeUserAgent and keep getting this error on my Linux Server. Traceback (most recent call last): File "/usr/local/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks result = g.send(result) …
Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
3
votes
1 answer

Scrapy spider middleware

I have a function (check_duplicates()) in the spider that checks for the presence of a url in my database, and in case of absence passes the url on to the parse_product method: def check_duplicates(url): connection = mysql.connector.connect( …
m_sasha
  • 239
  • 1
  • 7
2
votes
1 answer

Scrapy appears to be deduplicating the first request when it is processed with DownloaderMiddleware

I've got a certain spider which inherits from SitemapSpider. As expected, the first request on startup is to sitemap.xml of my website. However, for it to work correctly I need to add a header to all the requests, including the initial ones which…
keddad
  • 1,398
  • 3
  • 14
  • 35
1
vote
0 answers

How to pause a scrapy spider and make the other keep on scraping?

I am facing a problem with my custom retry middleware in scrapy. I have a project made of 6 spiders, launched by a little script containing a CrawlerProcess(), crawling 6 different websites. They should work simultaneously and here is the problem: i…
1
vote
1 answer

Scrapy Middleware Selenium with meta

Basically, I have a working version of middleware to pass all requests through selenium and return HtmlResponse, the problem is I also want to have some meta data to be attached to the request which I can access in parse method of spider. For some…
1
vote
2 answers

Trigger errback when process_exception() is called in Middleware

Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually. My problem is that if such a twisted.internet.error.TimeoutError occurs, i want to…
nichoio
  • 6,289
  • 4
  • 26
  • 33
1
vote
1 answer

Using regex in scrapy downolader middleware

I've been trying to make a custom middleware in Scrapy, which will flag urls containing certain patterns using regex. In short, there is a list of exceptions, and each url is checked against it. However, the middleware does not manage to properly…
T the shirt
  • 79
  • 12
1
vote
0 answers

Scrapy error catching in scrapy/middleware.py file: TypeError: __init__() missing 1 required positional argument: 'uri'

I am catching this error while starting a crawl. I have searched for an answer in several forums, and looked at the code in scrapy/middleware.py (came standard with scrapy and I have not altered it) and cannot figure out why I am getting an error.…
Steve S
  • 11
  • 1
1
vote
0 answers

Scrapy doing retry after yield

I am new to python and scrapy, and now I am making a simply scrapy project for scraping posts from a forum. However, sometimes when crawling the post, it got a 200 but redirect to empty page (maybe because the instability server of the forum or…
Joe Leung
  • 121
  • 9
1
vote
0 answers

Scrapy - bug in custom DownloaderMiddleware

I have list of thousands of URL which I scrape using one Spider. Some URLs has the same domain. I want to count a number of Timeout errors per domain. If for domain x, is a number of Timeouts higher than LIMIT, I want to avoid scraping all URLs of…
Milano
  • 18,048
  • 37
  • 153
  • 353
0
votes
2 answers

Retrying Downloader Middleware For Failed Requests in Scrapy

In scrapy I'm trying to write a downloader middleware which filters the responses with 401, 403,410 and sends these URLs some new requests. The error says that response_request must return a Response or a Request. Because I yield 10 requests to make…
avakado0
  • 101
  • 1
  • 9
0
votes
1 answer

How to build own middleware in Scrapy?

I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code: import gspread from oauth2client.service_account import…
m_sasha
  • 239
  • 1
  • 7
0
votes
1 answer

How can I read all logs at middleware?

I have about 100 spiders on a server. Every morning all spiders start scraping and writing all of the logs in their logs. Sometimes a couple of them gives me an error. When a spider gives me an error I have to go to the server and read from log file…
Murat Demir
  • 716
  • 7
  • 26
0
votes
0 answers

Too many 429 errors when the cache extension and the proxy middleware are enabled at the same time in scrapy

I am using scrapy to crawl data. The target website blocks the IP after it sends about 1000 requests. To deal with this, I wrote a proxy middleware, and because the amount of data is relatively large, I also wrote a cache extension. When I enabled…
Sherwin
  • 11
  • 1
0
votes
1 answer

How can I use scrapy middlewares to call a mail function?

I have 15 spiders and every spider has its own content to send mail. My spiders also have their own spider_closed method which starts the mail sender but all of them same. At some point, the spider count will be 100 and I don't want to use the same…
Murat Demir
  • 716
  • 7
  • 26
1
2