Can scrapy yield both request and items?

Question

When I write parse() function, can I yield both a request and items for one single page?

I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).

I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?

I am not sure i understand your problem. Maybe this is related: http://stackoverflow.com/q/9334522/248296 — warvariuc, Jan 01 '13 at 17:08

score 24 · Answer 1 · answered Apr 11 '13 at 03:48

Yes, you can yield both requests and items. From what I've seen:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    base_url = response.url
    links = hxs.select(self.toc_xpath)

    for index, link in enumerate(links):
        href, text = link.select('@href').extract(), link.select('text()').extract()
        yield Request(urljoin(base_url, href[0]), callback=self.parse2)

    for item in self.parse2(response):
        yield item

score 10 · Accepted Answer · answered Jan 02 '13 at 04:00

I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this bitch grabs it
   return item

score 4 · Answer 3 · edited Feb 26 '15 at 13:12

from Steven Almeroth in google groups:

You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.

Try this:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    base_url = response.url
    links = hxs.select(self.toc_xpath)

    for index, link in enumerate(links):
        href, text = link.select('@href').extract(), link.select('text()').extract()
        yield Request(urljoin(base_url, href[0]), callback=self.parse2)

    for item in self.parse2(response):
        yield item

Can scrapy yield both request and items?

3 Answers3