0

is it possible to make a CrawlSpider after taking url input from a user? normally when we are creating a spider we physically give or specify a certain url. Is it possible to take an url from a user, and create a crawlspider based on that url?

Jin-Dominique
  • 3,043
  • 6
  • 19
  • 28
  • Probably. I doubt that CrawlSpider cares where you got the url from ... Do you have some example code for how you would create one without user input (i.e. "specifying a certain url")? – mgilson Dec 05 '13 at 02:46
  • 1
    Have a look at http://stackoverflow.com/questions/15611605/how-to-pass-a-user-defined-argument-in-scrapy-spider or http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments and adapt to CrawlSpider – paul trmbrth Dec 05 '13 at 08:46

1 Answers1

0

Yes. You can write a scrapy spider in a single python file and then write another python file to do those operations:

  • receive the url that user input
  • execute the spider with url arguments like that you execute a system command in a python file

Something like this:
You have a scrapy spider named myspider.py and a control script named controlspider.py (refer to this doc to learn how to run a scrapy spider in a signle file).
In controlspider.py, you could run the spider with urls that user has input like this:

import subprocess
subprocess.Popen(['scrapy', 'runspider', 'myspider.py', 'URL_LIST_USER_INPUT'])

Then in myspider.py, you could receive the arguments in a subclass of CrawlSpider like this:

import sys
class MySpider(CrawlSpider):
    def __init__(self):
        ......
        # store the urls that you pass to the spider in 'controlspider.py'
        self.urls = sys.argv

The solution seems not elegant. In my opinion, if the logic of your crawl task is not so complex that you have to use scrapy, you could just write a small spider that could receive urls, crawl the page, parse the page and store the results. This will be very flexible and not so hard to implement.

flyer
  • 9,280
  • 11
  • 46
  • 62