2

Perhaps yield in Python is remedial for some, but not for me... at least not yet. I understand yield creates a 'generator'.

I stumbled upon yield when I decided to learn scrapy. I wrote some code for a Spider which works as follows:

  1. Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
  2. Examines hyperlinks appends those meeting specific criteria to base hyperlink
  3. Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy

class newSpider(scrapy.Spider)
    name = 'new'
    allowed_domains = ['www.alloweddomain.com']
    start_urls = ['https://www.alloweddomain.com']

    def parse(self, response)
        links = response.xpath('//a/@href').extract()
        for link in links:
            if link == 'SpecificCriteria':
                next_link = response.urljoin(link)
                yield Request(next_link, callback=self.parse_new)

EDIT 1:

                for uid_dict in self.parse_new(response):
                   print(uid_dict['uid'])
                   break

End EDIT 1

Running the code here evaluates response as the HTTP response to start_urls and not to next_link.

    def parse_new(self, response)
        trs = response.xpath("//*[@class='unit-directory-row']").getall()
        for tr in trs:
            if 'SpecificText' in tr:
                elements = tr.split()
                for element in elements:
                    if 'onclick' in element:
                        subelement = element.split('(')[1]
                        uid = subelement.split(')')[0]
                        print(uid)
                        yield {
                            'uid': uid
                        }
                break

It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.

What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.

M.Sqrl
  • 394
  • 3
  • 12
  • 2
    You need to iterate on what is returned by the method – azro May 12 '20 at 19:43
  • 1
    Does this answer your question? [How to use yield function in python](https://stackoverflow.com/questions/42428310/how-to-use-yield-function-in-python) – Diggy. May 12 '20 at 19:44

2 Answers2

0

I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.

In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,

for uid_dict in spider.parse_new(response):
    print(uid_dict['uid'])
Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
  • 1
    I've read several articles about `yield`. Coming from more of a VBA background it my brain hurts thinking about it, but will adjust. If I understand correctly `yield` creates a 'generator' in memory rather than storing a set of generated values. I imagine a generator as a function that can be called upon at any time. Need to finish reading that reference. Seems pretty good. – M.Sqrl May 12 '20 at 19:51
  • “called upon at any time” ← This bit is not accurate, actually. A generator can only be iterated once, which is important to understand to avoid future headaches. – Gallaecio May 12 '20 at 21:11
  • I think he meant that you can call `next` any time. – Daniel Walker May 12 '20 at 22:30
  • Still lost. It seems like `yield Request(new_link, callback=self.parse_new)` creates `parse_new()` as a generator. So what exactly is `response`? In your code I presume `uid_dict` is treated as a variable returned from the generator. `response` is being passed to the generator to evaluate and return `uid_dict`. But when I step through the code `response` continues as the response to the original `start_urls` request. So when and how does `response` get assigned to the `new_link` request? – M.Sqrl May 13 '20 at 20:24
  • Back to the quotesbotSpider... `def parse(self, response): for quote in response.xpath(''): print(quote) yield { 'text': quote.xpath('xref').extract_first(), 'author': quote.xpath('xref').extract_first(), 'tags': quote.xpath('xref').extract() } print(quote['text'])` print(quote) works, print(quote['text']) does not. `'Selector' object is not subscriptable.` I have no idea how to access the generated dictionary. – M.Sqrl May 14 '20 at 13:37
0

After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:

1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py

2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.

M.Sqrl
  • 394
  • 3
  • 12