2

I would like to keep a scrapy crawler constantly running inside a celery task worker probably using something like this. Or as suggested in the docs The idea would be to use the crawler for querying an external API returning XML responses. I would like to pass the URL (or query parameters and let the crawler build the URL) I want to query to the crawler, and the crawler would make the URL call, and give me back the extracted items. How can I pass this new URL I want to fetch to the crawler once it started running. I do not want to restart the crawler every time I want to give it a new URL, instead I want the crawler to sit idly waiting for URLs to crawl.

The two methods I've spotted to run scrapy inside another python process use a new Process to run the crawler in. I would like to not have to fork and teardown a new process every time I want to crawl a URL, since that is pretty expensive and unnecessary.

Community
  • 1
  • 1
Andres
  • 2,880
  • 4
  • 32
  • 38

2 Answers2

0

Just have a spider that polls a database (or file?) that when presented with a new URL creates and yields a new Request() object for it.

You can build it by hand easily enough. There is probably a better way to do it than that, but thats basically what I did for an open-proxy scraper. The spider gets a list of all the 'potential' proxies from the database and generates a Request() object for each one - when they're returned they're then dispatched down the chain and verified by downstream middleware and their records are updated by item pipeline.

synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91
  • Yes, I had considered something like that, even using https://github.com/darkrho/scrapy-redis but I was planning on running the crawler itself as a celery task, which I think would be easier to manage. I may have to give it some more thought, whether having it poll redis ontop of running inside Celery is too much potential for a clusterfluff. The main reason I'd like to keep Celery is because of the many tools to manage workers and create workflows (like the canvas). So any ideas for the original question? – Andres May 23 '13 at 04:22
  • An alternative to polling externally would be to augment scrapyd - note that it has a JSON and (something else) API that you can connect to and start/stop jobs, etc.. Instead of trying to modify a running spider - maybe you just do some server pooling and launch a new instance of a generic spider? Then you can avoid any third party arbitration and keep it all under one roof. Somewhere I have a half-finished integration of https://github.com/jrydberg/txgossip‎ into scrapyd - my thinking was create a peer-to-peer clown-computers for scraping that you could admin by injecting new 'jobs'. – synthesizerpatel May 23 '13 at 05:29
0

You could use a message queue (like IronMQ--full disclosure, I work for the company that makes IronMQ as a developer evangelist) to pass in the URLs.

Then in your crawler, poll for the URLs from the queue, and crawl based on the messages you retrieve.

The example you linked to could be updated (this is untested and pseudocode, but you should get the basic idea):

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ

mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
    msg = q.get(timeout=120) # get messages from queue
                             # timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
    if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
        time.sleep(1) # wait one second
        continue # try again
    spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run() # the script will block here
    q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it
Paddy
  • 2,793
  • 1
  • 22
  • 27