0

Is it possible for Scrapy to do like when I call the function scrape.crawl("website") in a class, it would redirect to the class where the scraping codes are and execute the function.

I tried to find in various sources and mostly asked me to write it as a script form. But couldn't find any working example that shows me how to initialise the object so as to call the script.

Came close to this code but it's not working.

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
        loader.add_xpath('title', 'a/text()')
        loader.add_xpath('link', 'a/@href')
        loader.add_xpath('desc', 'text()')
        yield loader.load_item()

Calling the object?

spider = DmozSpider()

Any kind souls with working example with what I want?

Qing Yong
  • 127
  • 3
  • 15
  • See http://stackoverflow.com/questions/18838494/scrapy-very-basic-example/27744766#27744766. – alecxe Jul 09 '15 at 14:36
  • @alecxe Hi. I tried your example but got this error: update_setting not found. The code that is causing the error is this line: settings = Settings() – Qing Yong Jul 10 '15 at 01:26

1 Answers1

0

For this you need a quite complex structure -- if I understand your question right.

If you have your instance of the spider you need to set up a Crawler and start it afterwards. For example:

crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()

This is the base but you should be able to get started with this. However as I've said previously it is quite complex and you need some configuration beside this to get it running.

Update

If you have a URL and want Scrapy to crawl that site you could do it like this:

def __init__(self, url, *args, **kwargs): 
  super(DmozSpider, self).__init__(*args, **kwargs) 

  self.start_urls = [url]

And then start crawling like described above. Because Scrapy spiders start as soon as you call them you need the right sequence of setting the start URL and then starting.

GHajba
  • 3,665
  • 5
  • 25
  • 35