37
def parse(self, response):
    for sel in response.xpath('//tbody/tr'):
        item = HeroItem()
        item['hclass'] = response.request.url.split("/")[8].split('-')[-1]
        item['server'] = response.request.url.split('/')[2].split('.')[0]
        item['hardcore'] = len(response.request.url.split("/")[8].split('-')) == 3
        item['seasonal'] = response.request.url.split("/")[6] == 'season'
        item['rank'] = sel.xpath('td[@class="cell-Rank"]/text()').extract()[0].strip()
        item['battle_tag'] = sel.xpath('td[@class="cell-BattleTag"]//a/text()').extract()[1].strip()
        item['grift'] = sel.xpath('td[@class="cell-RiftLevel"]/text()').extract()[0].strip()
        item['time'] = sel.xpath('td[@class="cell-RiftTime"]/text()').extract()[0].strip()
        item['date'] = sel.xpath('td[@class="cell-RiftTime"]/text()').extract()[0].strip()
        url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[@class="cell-BattleTag"]//a/@href').extract()[0].strip()

        yield Request(url, callback=self.parse_profile)

def parse_profile(self, response):
    sel = Selector(response)
    item = HeroItem()
    item['weapon'] = sel.xpath('//li[@class="slot-mainHand"]/a[@class="slot-link"]/@href').extract()[0].split('/')[4]
    return item

Well, I'm scraping a whole table in the main parse method and I have taken several fields from that table. One of these fields is an url and I want to explore it to get a whole new bunch of fields. How can I pass my already created ITEM object to the callback function so the final item keeps all the fields?

As it is shown in the code above, I'm able to save the fields inside the url (code at the moment) or only the ones in the table (simply write yield item) but I can't yield only one object with all the fields together.

I have tried this, but obviously, it doesn't work.

yield Request(url, callback=self.parse_profile(item))

def parse_profile(self, response, item):
    sel = Selector(response)
    item['weapon'] = sel.xpath('//li[@class="slot-mainHand"]/a[@class="slot-link"]/@href').extract()[0].split('/')[4]
    return item
Navid777
  • 3,591
  • 8
  • 41
  • 69
vic
  • 371
  • 1
  • 3
  • 4
  • Try to have a look at decorators, eg. http://thecodeship.com/patterns/guide-to-python-function-decorators/ – Overclover Aug 27 '15 at 14:40
  • So the url returns fields which are not present in `item` and you want to add these fields to `item` and return it? – Michael S Priz Aug 27 '15 at 14:43
  • For the Python-general method refer to [callback - Python, how to pass an argument to a function pointer parameter? - Stack Overflow](https://stackoverflow.com/questions/13783211/python-how-to-pass-an-argument-to-a-function-pointer-parameter) -- but in this case there's a scrapy-specific (possibly better) method. – user202729 Aug 15 '21 at 06:16

4 Answers4

61

This is what you'd use the meta Keyword for.

def parse(self, response):
    for sel in response.xpath('//tbody/tr'):
        item = HeroItem()
        # Item assignment here
        url = 'https://' + item['server'] + '.battle.net/' + sel.xpath('td[@class="cell-BattleTag"]//a/@href').extract()[0].strip()

        yield Request(url, callback=self.parse_profile, meta={'hero_item': item})

def parse_profile(self, response):
    item = response.meta.get('hero_item')
    item['weapon'] = response.xpath('//li[@class="slot-mainHand"]/a[@class="slot-link"]/@href').extract()[0].split('/')[4]
    yield item

Also note, doing sel = Selector(response) is a waste of resources and differs from what you did earlier, so I changed it. It's automatically mapped in the response as response.selector, which also has the convenience shortcut of response.xpath.

Rejected
  • 4,445
  • 2
  • 25
  • 42
  • 1
    This should be the accepted answer. – Daniel Gerber Dec 28 '21 at 12:41
  • How to pass variables to the error callback? – West Mar 03 '22 at 11:25
  • @DanielGerber - `meta` shouldn't be used for this based on the latest doc https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments > Request.cb_kwargs was introduced in version 1.7. Prior to that, using Request.meta was recommended for passing information around callbacks. After 1.7, Request.cb_kwargs became the preferred way for handling user information, leaving Request.meta for communication with components like middlewares and extensions. – Noel Bautista Mar 31 '23 at 22:39
16

Here's a better way to pass args to callback function:

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             cb_kwargs=dict(main_url=response.url))
    request.cb_kwargs['foo'] = 'bar'  # add more arguments for the callback
    yield request

def parse_page2(self, response, main_url, foo):
    yield dict(
        main_url=main_url,
        other_url=response.url,
        foo=foo,
    )

source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments

penduDev
  • 4,743
  • 35
  • 37
-1

I had a similar issue with Tkinter's extra argument passing, and found this solution to work (here: http://infohost.nmt.edu/tcc/help/pubs/tkinter/web/extra-args.html), converted to your problem:

def parse(self, response):
    item = HeroItem()
    [...]
    def handler(self = self, response = response, item = item):
        """ passing as default argument values """
        return self.parse_profile(response, item)
    yield Request(url, callback=handler)
rolika
  • 381
  • 1
  • 10
  • 1
    This is a dangerous suggestion. He's looping through all of the "items" found in `response.xpath('//tbody/tr')`. Since the Request will not provide an item as a parameter in the callback (ever), the handler method will always use item as the default. Unfortunately, item will be whatever it is *at the time the callback call is made* not what it was at the time the Request is yielded. Your collected data will be unreliable and inconsistent. – Rejected Aug 27 '15 at 15:58
  • @Rejected No, by assigning the variables in the function header (self=self...) it holds the values of the variables at the time the `handler` function definition is executed. so long the definition for `handler` is inside the loop, `parse_profile` will get the values of each item being iterated over. – Alan Hoover Aug 27 '15 at 16:04
  • This is a nicely elegant solution. – Alan Hoover Aug 27 '15 at 16:08
  • @AlanHoover I was under the impression that since the callback on a Request can happen later, the function itself is redefined, and the redefined function is called when the callback is executed. I recall running into that myself, and I'm pretty sure that I wasn't doing any late binding of any parameters. I'll do some tests! – Rejected Aug 27 '15 at 17:02
-2

@peduDev

Tried your approach but something failed due to an unexpected keyword.

scrapy_req = scrapy.Request(url=url, 
callback=self.parseDetailPage,
cb_kwargs=dict(participant_id=nParticipantId))


def parseDetailPage(self, response, participant_id ):
    .. Some code here..
    yield MyParseResult (
        .. some code here ..
        participant_id = participant_id
    )

Error reported
, cb_kwargs=dict(participant_id=nParticipantId)
TypeError: _init_() got an unexpected keyword argument 'cb_kwargs'

Any idea what caused the unexpected keyword argument other than perhaps an to old scrapy version?

Yep. I verified my own suggestion and after an upgrade it all worked as suspected.

sudo pip install --upgrade scrapy

Jan
  • 35
  • 4