is it possible to make a CrawlSpider after taking url input from a user? normally when we are creating a spider we physically give or specify a certain url. Is it possible to take an url from a user, and create a crawlspider based on that url?
-
Probably. I doubt that CrawlSpider cares where you got the url from ... Do you have some example code for how you would create one without user input (i.e. "specifying a certain url")? – mgilson Dec 05 '13 at 02:46
-
1Have a look at http://stackoverflow.com/questions/15611605/how-to-pass-a-user-defined-argument-in-scrapy-spider or http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments and adapt to CrawlSpider – paul trmbrth Dec 05 '13 at 08:46
1 Answers
Yes. You can write a scrapy spider in a single python file and then write another python file to do those operations:
- receive the url that user input
- execute the spider with url arguments like that you execute a system command in a python file
Something like this:
You have a scrapy spider named myspider.py and a control script named controlspider.py (refer to this doc to learn how to run a scrapy spider in a signle file).
In controlspider.py, you could run the spider with urls that user has input like this:
import subprocess
subprocess.Popen(['scrapy', 'runspider', 'myspider.py', 'URL_LIST_USER_INPUT'])
Then in myspider.py, you could receive the arguments in a subclass of CrawlSpider like this:
import sys
class MySpider(CrawlSpider):
def __init__(self):
......
# store the urls that you pass to the spider in 'controlspider.py'
self.urls = sys.argv
The solution seems not elegant. In my opinion, if the logic of your crawl task is not so complex that you have to use scrapy, you could just write a small spider that could receive urls, crawl the page, parse the page and store the results. This will be very flexible and not so hard to implement.

- 9,280
- 11
- 46
- 62