Is Scrapy able to scrape the data once I initialise it's object?

Question

Is it possible for Scrapy to do like when I call the function scrape.crawl("website") in a class, it would redirect to the class where the scraping codes are and execute the function.

I tried to find in various sources and mostly asked me to write it as a script form. But couldn't find any working example that shows me how to initialise the object so as to call the script.

Came close to this code but it's not working.

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
        loader.add_xpath('title', 'a/text()')
        loader.add_xpath('link', 'a/@href')
        loader.add_xpath('desc', 'text()')
        yield loader.load_item()

Calling the object?

spider = DmozSpider()

Any kind souls with working example with what I want?

See http://stackoverflow.com/questions/18838494/scrapy-very-basic-example/27744766#27744766. — alecxe, Jul 09 '15 at 14:36
@alecxe Hi. I tried your example but got this error: update_setting not found. The code that is causing the error is this line: settings = Settings() — Qing Yong, Jul 10 '15 at 01:26

GHajba · Answer 1 · 2015-07-09T08:39:26.113

For this you need a quite complex structure -- if I understand your question right.

If you have your instance of the spider you need to set up a Crawler and start it afterwards. For example:

crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()

This is the base but you should be able to get started with this. However as I've said previously it is quite complex and you need some configuration beside this to get it running.

Update

If you have a URL and want Scrapy to crawl that site you could do it like this:

def __init__(self, url, *args, **kwargs): 
  super(DmozSpider, self).__init__(*args, **kwargs) 

  self.start_urls = [url]

And then start crawling like described above. Because Scrapy spiders start as soon as you call them you need the right sequence of setting the start URL and then starting.

What I want is a class which I can call the object to start crawling. Simple. Is Scrapy able to do it? — Qing Yong, Jul 09 '15 at 08:01

Is Scrapy able to scrape the data once I initialise it's object?

1 Answers1