12

When I write parse() function, can I yield both a request and items for one single page?

I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).

I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
kuafu
  • 1,466
  • 5
  • 17
  • 28
  • 1
    I am not sure i understand your problem. Maybe this is related: http://stackoverflow.com/q/9334522/248296 – warvariuc Jan 01 '13 at 17:08

3 Answers3

24

Yes, you can yield both requests and items. From what I've seen:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    base_url = response.url
    links = hxs.select(self.toc_xpath)

    for index, link in enumerate(links):
        href, text = link.select('@href').extract(), link.select('text()').extract()
        yield Request(urljoin(base_url, href[0]), callback=self.parse2)

    for item in self.parse2(response):
        yield item
Cacovsky
  • 2,536
  • 3
  • 23
  • 27
10

I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this bitch grabs it
   return item
Chris Hawkes
  • 11,923
  • 6
  • 58
  • 68
4

from Steven Almeroth in google groups:

You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.

Try this:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    base_url = response.url
    links = hxs.select(self.toc_xpath)

    for index, link in enumerate(links):
        href, text = link.select('@href').extract(), link.select('text()').extract()
        yield Request(urljoin(base_url, href[0]), callback=self.parse2)

    for item in self.parse2(response):
        yield item
kuixiong
  • 505
  • 1
  • 4
  • 16
David Dehghan
  • 22,159
  • 10
  • 107
  • 95