Use Scrapy to iteratively get items from two sites

Question

I'm halfway through writing a scraper with Scrapy and am worried that its asynchronous behaviour may result in issues.

I start on a page that has several links a from each of which I get an x. These x are saved (downloaded). Then I go to another page b where I use some info I got from one of the a links (it is constant for all of them) to select and download y.
Then I "pair" x and y, how I pair them is not important what matters is just that x and y both exist (are downloaded).
Now I would consider my starting page (start_urls) processed, and I would get the link to 'turn' the page (as in I'm on page 1 and am now going to page 2), which I then Request to start the process from the beginning.

The code looks roughly like this :

# ..imports, class etc.

start_url = ['bla']
start_url_buddy = ['bli']


def parse(self, response):

    urls = response.xpath(...)
    for url in urls:
        yield scrapy.Request(url, callback=self.parse_child)

    yield scrapy.Request(start_url_buddy, callback=self.parse_buddy)

    pair_function(self.info)

    # Finished processing start page. Now turning the page.
    # could do smth like this to get next page:
    nextpage_url = response.xpath(...@href)
    yield scrapy.Request(nextpage_url)

    # or maybe something like this?
    start_urls.append(response.xpath(...@href))

# links `a`
def parse_child(self, response):

    # info for processing link `b`
    self.info = response.xpath(...)

    # Download link
    x = response.xpath(...)
    # urlopen etc. write x to file in central dir

# link `b`
def parse_buddy(self, response):

    # Download link
    y = response.xpath(...self.info...)
    # urlopen etc. write y to file in central dir

I haven't gotten to the page turning part yet and am worried whether that will work as intended (I'm fiddling with the merge function atm, getting xs and y works fine for one page). I don't care in what order the xs and y are gotten as long as it's before pair_function and 'turning the page' (when the parse function should be again).

I have looked at a couple other SO questions like this but I haven't been able to get a clear answer from them. My basic problem is I'm unsure as to how exactly the asynchronicity is implemented (it doesn't seem to explained in the docs?).

EDIT: To be clear what I'm scared will happen is that yield scrapy.Request(nextpage_url) will be called before the previous one's have gone through. I'm now thinking I can maybe safe guard against that by just appending to start_urls (as I've done in the code) after everything has been done (the logic being this should result in the parse function being called on the appended url?

When I've done this before the pages are requested sequentially but their responses are received asynchronously. Is this what you're asking? — Tyler, Dec 27 '16 at 22:31

score 2 · Answer 1 · answered Dec 27 '16 at 22:34

2

You won't be able to know when a request is finished, as scrapy is processing all your requests, but it doesn't wait for the requested server to return a response before processing the next pending request.

About asynchronous calls, you don't know "when" they will end, but you know "where", and that's the callback method for. So for example if you want for sure do a request after another you can do something like:

def callback_method_1(self, response):
    # working with response 1
    yield Request(url2, callback=self.callback_method_2)

def callback_method_2(self, response):
    # working with response 2, after response 1
    yield Request(url3, callback=self.callback_method_3)

def callback_method_3(self, response):
    # working with response 3, after response 2 
    yield myfinalitem

In this example you know for sure that the first request, was done before the url2 request, and that was before url3. As you can see, you don't know exactly "when" these requests were done, but you do know "where".

Also remember that a way to communicate between callbacks is using the meta request argument.

answered Dec 27 '16 at 22:34

eLRuLL

18,488
9
73
99

It doesn't matter to me in what order the first `Requests` are called (the urls `url` in `urls`). The only thing that matters to me is that these are all finished before the last one is called `yield scrapy.Request(nextpage_url)`. – Nimitz14 Dec 27 '16 at 22:41
1

I only explained to you how scrapy works. If you send multiple requests from one method (like your `parse`) method, all those requests are being asynchronously handled, so it doesn't matter in which line of the code it is, you'll have to accommodate your code to this idea of callbacks. – eLRuLL Dec 27 '16 at 22:44
Ah, sorry, thank you. So could I "hide" a request by putting it in a separate function? If I had a `parse` function with three `yield` requests and I put the last one inside of a separate function would that guarantee the first two would be processed before the last one? – Nimitz14 Dec 27 '16 at 22:53
you could set the `callback` method for both requests to the same method, and inside that method maybe check if it was the "second" time it was called (with a meta parameter, or even a class counter argument) and then call the required method – eLRuLL Dec 27 '16 at 23:01

Use Scrapy to iteratively get items from two sites

1 Answers1