0

I'm going through the source of a library. The question I have is general and not specific to the library. The code I'm wondering about looks like this:

class SpiderMiddlewareManager(MiddlewareManager):
    # ...
    def process_start_requests(self, start_requests, spider):
        return self._process_chain('process_start_requests', start_requests, spider)

class
    # ...
    def open_spider(self, spider, start_requests=(), close_if_idle=True):
        # ...
        start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)
        # ...

My question is: How is start_requests = yield self.scraper.spidermw.process_start_requests(...) different to start_requests = self.scraper.spidermw.process_start_requests(...) since self.scraper.spidermw.process_start_requests already returns a value. If my understanding is correct, open_spider isn't a generator.

Thanks

Kar
  • 6,063
  • 7
  • 53
  • 82
  • I don't see how this is a dup - `open_spider` shouldn't be a generator. So I don't see how my question is a duplicate. – Kar Mar 22 '15 at 13:50
  • `open_spider` is a generator. – interjay Mar 22 '15 at 13:59
  • But what value is `start_requests` supposed to take? – Kar Mar 22 '15 at 14:00
  • @Kate whatever is sent - see e.g. http://stackoverflow.com/q/2022218/3001761 – jonrsharpe Mar 22 '15 at 14:01
  • It looks like the method using the `yield` call is decorated with a `@deferred.inlineCallbacks`, which means it's a twisted coroutine. The `yield` allows it to return control to the twisted reactor while blocking I/O runs. – dano Mar 22 '15 at 14:03

1 Answers1

1

There's an important detail missing from your question - the library you're looking at is written on top of twisted, which is an asynchronous networking framework. The complete method declaration actually looks like this:

    @defer.inlineCallbacks
    def open_spider(self, spider, start_requests=(), close_if_idle=True):
        assert self.has_capacity(), "No free spider slot when opening %r" % \
            spider.name
        log.msg("Spider opened", spider=spider)
        nextcall = CallLaterOnce(self._next_request, spider)
        scheduler = self.scheduler_cls.from_crawler(self.crawler)
        start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)

The defer.inlineCallbacks decorator does some magic with all the calls that used yield. Essentially it lets you write asynchronous code that would normally use callbacks, in a way that looks synchronous:

inlineCallbacks helps you write Deferred-using code that looks like a regular sequential function. This function uses features of Python 2.5 generators. If you need to be compatible with Python 2.4 or before, use the deferredGenerator function instead, which accomplishes the same thing, but with somewhat more boilerplate. For example:

 @inlineCallBacks    
 def thingummy():
     thing = yield makeSomeRequestResultingInDeferred()
     print thing #the result! hoorj! 

When you call anything that results in a Deferred, you can simply yield it; your generator will automatically be resumed when the Deferred's result is available. The generator will be sent the result of the Deferred with the send method on generators, or if the result was a failure, throw.

Your inlineCallbacks-enabled generator will return a Deferred object, which will result in the return value of the generator (or will fail with a failure object if your generator raises an unhandled exception). Note that you can't use return result to return a value; use returnValue(result) instead. Falling off the end of the generator, or simply using return will cause the Deferred to have a result of None.

If you dig into the process_start_requests call, you'll find it ultimately calls scrapy.util.defer.process_chain, which returns a Deferred:

def process_chain(callbacks, input, *a, **kw):
    """Return a Deferred built by chaining the given callbacks"""
    d = defer.Deferred()
    for x in callbacks:
        d.addCallback(x, *a, **kw)
    d.callback(input)
    return d
dano
  • 91,354
  • 19
  • 222
  • 219