Function in BaseSpider class to yield a request

Question

I'm trying to create a function that takes care of a recurring task in multiple spiders. It involves yielding a request that seems to break it. This question is a follow-up from this question.

import scrapy
import json
import re

class BaseSpider(scrapy.Spider):

    start_urls = {}

    def test(self, response, cb, xpath):
        self.logger.info('Success')
        for url in response.xpath(xpath).extract():
            req = scrapy.Request(response.urljoin(url), callback=cb)
            req.meta['category'] = response.meta.get('category')
            yield req

When the yield req is in the code, the "Success" logger suddenly does not work anymore and the callback function does not seem to be called. When yield req is commented, the logger does show the "Success" logger. Although I don't think the issue is in the spider, below the code of the spider:

# -*- coding: utf-8 -*-
import scrapy
from crawling.spiders import BaseSpider

class testContactsSpider(BaseSpider):
    """ Test spider """
    name = "test"
    start_urls = {}
    start_urls['test'] = 'http://www.thewatchobserver.fr/petites-annonces-montres#.WfMaIxO0Pm3'

    def parse(self,response):
        self.logger.info('Base page: %s', response.url)
        self.test(response, self.parse_page, '//h3/a/@href')

    def parse_page(self, response):
        self.logger.info('Page: %s', response.url)

Fidan · Accepted Answer · 2017-10-31T11:16:24.157

I think you need something like this:

   def parse(self,response):
    self.logger.info('Base page: %s', response.url)
    for req in self.test(response, self.parse_page, '//h3/a/@href'):
        yield req

Test method yields results and thats why it return generator type. Try code from below and read this for Understanding Generators in Python:

def test():
  print('Inside generator!')
  for i in range(5):
    yield i

print('============')
g = test() #save as variable
test()     #trying to call func
print('============')
print(next(g)) #next of "g" generator
print(next(g))
print('============')
print(next(test())) #next of newly created generator
print(next(test()))
print('============')
for i in test(): #for each elem that returns generator
  print(i)

In this example we not using generator:

self.test(response, self.parse_page, '//h3/a/@href')

In this we trying to get next element and thats why it's called:

    self.test(response, self.parse_page, '//h3/a/@href').next()
#or
    next(self.test(response, self.parse_page, '//h3/a/@href'))

this seems to work indeed. However, ideally I'd like to make my code as simple as possible. It's not possible to have these requests yielded directly in the `self.test` function without doing a loop in the `self.parse` function? — Casper, Oct 31 '17 at 14:02
I don't think that is possible. If you want to keep code a little simpler (_in python 3_) you can use `yield from test()` instead of `for req in test(): yield req`. — Fidan, Oct 31 '17 at 14:47

Function in BaseSpider class to yield a request

1 Answers1