11

Disclaimer: I'm fairly new to Scrapy.

To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?

Given the following sample Spider:

class SiteSpider(Spider):
    site_loader = SiteLoader
    ...
    def parse(self, response):
        item = Place()
        sel = Selector(response)
        bl = self.site_loader(item=item, selector=sel)
        bl.add_value('domain', self.parent_domain)
        bl.add_value('origin', response.url)
        for place_property in item.fields:
            parse_xpath = self.template.get(place_property)

            # parse_xpath will look like either:
            # '//path/to/property/text()'
            # or
            # {'url': '//a[@id="Location"]/@href', 
            #  'xpath': '//div[@class="directions"]/span[@class="address"]/text()'}
            if isinstance(parse_xpath, dict):  # place_property is at a URL
                url = sel.xpath(parse_xpath['url_elem']).extract()
                yield Request(url, callback=self.get_url_property,
                              meta={'loader': bl, 'parse_xpath': parse_xpath,
                                    'place_property': place_property})
            else:  # parse_xpath is just an xpath; process normally
                bl.add_xpath(place_property, parse_xpath)
        yield bl.load_item()

    def get_url_property(self, response):
        loader = response.meta['loader']
        parse_xpath = response.meta['parse_xpath']
        place_property = response.meta['place_property']
        sel = Selector(response)
        loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
        return loader

I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).

The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.

Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.

JoeLinux
  • 4,198
  • 1
  • 29
  • 31
  • parse() shouldn't yield an item if it isn't fully filled. Rather partially filled item should be passed to get_url_property and it should be returned/yielded from there. See also: https://stackoverflow.com/questions/9334522/scrapy-follow-link-to-get-additional-item-data/22011753#22011753 – Jan Wrobel Mar 05 '14 at 16:26
  • I understand that, but how can I get a Request object to resolve and process the callback without yielding from parse()? I also can't guarantee that every Item will involve requesting other URLs. Most of them won't. – JoeLinux Mar 05 '14 at 17:55
  • Maybe you should have a look at CrawlSpider. You can then setup some rules for how the spider should handle different links by adding a callback function. – Bj Blazkowicz Mar 06 '14 at 09:09
  • I tried CrawlSpider, but similar rules apply. Items are still treated separately, and I have to pass a single item down the chain across multiple URLs on some occasions. Either way, I read a comment that @JanWrobel posted elsewhere that gave me an idea. Here's to hoping... – JoeLinux Mar 06 '14 at 13:12
  • Consider this answer: https://stackoverflow.com/a/45498623/3140273 – TechWisdom Oct 07 '20 at 13:54

1 Answers1

9

If I understand you correctly, you have (at least) two different cases:

  1. The crawled page links to another page containing the data (1+ further request necessary)
  2. The crawled page contains the data (No further request necessary)

In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.

Possible Solution

A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.

For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.

For the second case, you can return the item directly, thus simply yield the item in the else clause.

The following code contains the changes to your example. Does this solve your problem?

def parse(self, response):

    # ...

    if isinstance(parse_xpath, dict):  # place_property is at a URL
        url = sel.xpath(parse_xpath['url_elem']).extract()
        yield Request(url, callback=self.get_url_property,
                      meta={'loader': bl, 'parse_xpath': parse_xpath,
                            'place_property': place_property})
    else:  # parse_xpath is just an xpath; process normally
        bl.add_xpath(place_property, parse_xpath)
        yield bl.load_item()

def get_url_property(self, response):

    loader = response.meta['loader']
    # ...
    loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
    yield loader.load_item()

Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Community
  • 1
  • 1
oliverguenther
  • 1,167
  • 1
  • 17
  • 31