Is it possible to make a function that generates CrawlSpider, based on the URL provided by a user?

Question

is it possible to make a CrawlSpider after taking url input from a user? normally when we are creating a spider we physically give or specify a certain url. Is it possible to take an url from a user, and create a crawlspider based on that url?

Probably. I doubt that CrawlSpider cares where you got the url from ... Do you have some example code for how you would create one without user input (i.e. "specifying a certain url")? — mgilson, Dec 05 '13 at 02:46
Have a look at http://stackoverflow.com/questions/15611605/how-to-pass-a-user-defined-argument-in-scrapy-spider or http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments and adapt to CrawlSpider — paul trmbrth, Dec 05 '13 at 08:46

flyer · Accepted Answer · 2013-12-06T06:46:26.727

Yes. You can write a scrapy spider in a single python file and then write another python file to do those operations:

receive the url that user input
execute the spider with url arguments like that you execute a system command in a python file

Something like this:
You have a scrapy spider named myspider.py and a control script named controlspider.py (refer to this doc to learn how to run a scrapy spider in a signle file).
In controlspider.py, you could run the spider with urls that user has input like this:

import subprocess
subprocess.Popen(['scrapy', 'runspider', 'myspider.py', 'URL_LIST_USER_INPUT'])

Then in myspider.py, you could receive the arguments in a subclass of CrawlSpider like this:

import sys
class MySpider(CrawlSpider):
    def __init__(self):
        ......
        # store the urls that you pass to the spider in 'controlspider.py'
        self.urls = sys.argv

The solution seems not elegant. In my opinion, if the logic of your crawl task is not so complex that you have to use scrapy, you could just write a small spider that could receive urls, crawl the page, parse the page and store the results. This will be very flexible and not so hard to implement.

Is it possible to make a function that generates CrawlSpider, based on the URL provided by a user?

1 Answers1