56

I have the item object and i need to pass that along many pages to store data in single item

LIke my item is

class DmozItem(Item):
    title = Field()
    description1 = Field()
    description2 = Field()
    description3 = Field()

Now those three description are in three separate pages. i want to do somrething like

Now this works good for parseDescription1

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item
    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

But i want something like

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

def parseDescription2(self,response):
    item = response.meta['item']
    item['desc2'] = "test2"
    return item

def parseDescription3(self,response):
    item = response.meta['item']
    item['desc3'] = "test3"
    return item
salmanwahed
  • 9,450
  • 7
  • 32
  • 55
user1858027
  • 997
  • 1
  • 13
  • 17

4 Answers4

40

No problem. Following is correct version of your code:

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item
cse
  • 4,066
  • 2
  • 20
  • 37
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • 1
    but how can i make sure that item which is received by `parseDescription2` already has `description1` in that item. – user1858027 Dec 17 '12 at 10:46
  • For that you will have to yield next request once the previous was processed. You have no choice, as you cannot control loading times. See [here](http://stackoverflow.com/questions/6566322) – warvariuc Dec 17 '12 at 10:59
  • thanks warwaruk, can you please have a look at this question, i am still not able to solve that http://stackoverflow.com/questions/13900877/scrapy-not-working-with-return-and-yield-together – user1858027 Dec 17 '12 at 12:31
  • 4
    Note that approach returns a total of three items (with each containing one 'descX' key?). If you want to gather the desc(1,2,3) into ONE item, you'll have to use the method by Dave McLain or the one I proposed. – oliverguenther Aug 29 '14 at 15:05
  • I'm confused...`pageParser` will NameError on seeing `item`. What is that supposed to be? – Nick T Feb 03 '16 at 22:44
  • @NickT, yes the code looks not working -- it wasn't tested -- just refactored from what the author had posted (his code has the same problem). The answer is about the approach, not the exact implementation. I think `item` is just a filled instance of `Item` class. – warvariuc Feb 04 '16 at 07:37
31

In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:

  def page_parser(self, response):
        sites = hxs.select('//div[@class="row"]')
        items = []

        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
        request.meta['item'] = Item()
        return [request]


  def parseDescription1(self,response):
        item = response.meta['item']
        item['desc1'] = "test"
        return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]


  def parseDescription2(self,response):
        item = response.meta['item']
        item['desc2'] = "test2"
        return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]

  def parseDescription3(self,response):
        item = response.meta['item']
        item['desc3'] = "test3"
        return [item]

Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.

If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.

Dave McLain
  • 311
  • 3
  • 2
  • 4
    If you want to return a single item, this is the way to go. However it exposes a problem: Depending on your use case, some of the parseDescription(1,2,3) methods might fail. If they do, the item is lost. Thus, see my answer for my proposal to this problem. – oliverguenther Aug 29 '14 at 15:07
  • Best explanation I've seen so far. But is that a mistake where on line 3 you have declared items[] (plural) while everywhere else it is item? – willdanceforfun Feb 26 '16 at 13:48
  • I'm happy that i found this answer. But the solution is so ugly because now returning the item is tucked away in an arbitrary parseDescription method. I would rather like to see the page_parser method returning the item. I expected i could capture the value after yielded like this `item = yield scrapy.Request(url=url, callback=self.callback)` but this is always `None` it seems. Unfortunate because it would make things so much clearer. – Flip Jun 23 '18 at 09:51
22

The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].

If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1, parseDescription2, and parseDescription3 to succeed and run without errors in order to return the item.

For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.


Workaround

Thus, I currently employ the following workaround: Instead of only passing the item around in the request.meta dict, pass around a call stack that knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.

The errback request parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.

def callnext(self, response):
    ''' Call next target for the item loader, or yields it if completed. '''

    # Get the meta object from the request, as the response
    # does not contain it.
    meta = response.request.meta

    # Items remaining in the stack? Execute them
    if len(meta['callstack']) > 0:
        target = meta['callstack'].pop(0)
        yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
    else:
        yield meta['loader'].load_item()

def parseDescription1(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    # Build the call stack
    callstack = [
        {'url': "http://www.example.com/lin2.cpp",
        'callback': self.parseDescription2 },
        {'url': "http://www.example.com/lin3.cpp",
        'callback': self.parseDescription3 }
    ]

    return self.callnext(response)

def parseDescription2(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    return self.callnext(response)


def parseDescription3(self, response):

    # ...

    return self.callnext(response)

Warning

This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.

For more information, check the blog post I wrote about that solution.

oliverguenther
  • 1,167
  • 1
  • 17
  • 31
  • how does the first url request occur? I am aware of start_urls and the possibility to do so by overriding start_requests() method. I, however, don't see anything similar. – secuaz Oct 13 '16 at 14:09
  • 1
    Can you please explain where do you define the item loader object and how do you pass it on the request for the first time in the code? – secuaz Oct 17 '16 at 11:59
  • @secuaz Just append it to your original `response.request.meta` – frenzy May 27 '18 at 15:30
3

All of the answers provided do have their pros and cons. I'm just adding an extra one to demonstrate how this has been simplified due to changes in the codebase (both Python & Scrapy). We no longer need to use meta and can instead use cb_kwargs (i.e. keyword arguments to pass to the callback function).

So instead of doing this:

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []

    request = Request("http://www.example.com/lin1.cpp",
                      callback=self.parseDescription1)
    request.meta['item'] = Item()
    return [request]


def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return [Request("http://www.example.com/lin2.cpp",
                    callback=self.parseDescription2, meta={'item': item})]
...

We can do this:

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []

    yield response.follow("http://www.example.com/lin1.cpp",
                          callback=self.parseDescription1,
                          cb_kwargs={"item": item()})


def parseDescription1(self,response, item):
    item['desc1'] = "More data from this new response"
    yield response.follow("http://www.example.com/lin2.cpp",
                          callback=self.parseDescription2,
                          cb_kwargs={'item': item})
...

and if for some reason you have multiple links you want to process with the same function, we can swap

yield response.follow(a_single_url,
                      callback=some_function,
                      cb_kwargs={"data": to_pass_to_callback})

with

yield from response.follow_all([many, urls, to, parse],
                               callback=some_function,
                               cb_kwargs={"data": to_pass_to_callback})
JakeCowton
  • 1,374
  • 5
  • 15
  • 35
  • what is the different/pros,cons between meta and cb_kwargs, could not find right answer – Pyd Oct 23 '20 at 15:57
  • 1
    @pyd passing them as `cb_kwargs` means they are passed as arguments to the callback function instead of needing to be extracted from `request['meta']`. You can see that `parseDesctiotion` takes `item` as an argument when using `cb_kwargs`. – JakeCowton Oct 23 '20 at 16:02