I'm halfway through writing a scraper with Scrapy and am worried that its asynchronous behaviour may result in issues.
I start on a page that has several links a
from each of which I get an x
. These x
are saved (downloaded). Then I go to another page b
where I use some info I got from one of the a
links (it is constant for all of them) to select and download y
.
Then I "pair" x
and y
, how I pair them is not important what matters is just that x
and y
both exist (are downloaded).
Now I would consider my starting page (start_urls
) processed, and I would get the link to 'turn' the page (as in I'm on page 1 and am now going to page 2), which I then Request to start the process from the beginning.
The code looks roughly like this :
# ..imports, class etc.
start_url = ['bla']
start_url_buddy = ['bli']
def parse(self, response):
urls = response.xpath(...)
for url in urls:
yield scrapy.Request(url, callback=self.parse_child)
yield scrapy.Request(start_url_buddy, callback=self.parse_buddy)
pair_function(self.info)
# Finished processing start page. Now turning the page.
# could do smth like this to get next page:
nextpage_url = response.xpath(...@href)
yield scrapy.Request(nextpage_url)
# or maybe something like this?
start_urls.append(response.xpath(...@href))
# links `a`
def parse_child(self, response):
# info for processing link `b`
self.info = response.xpath(...)
# Download link
x = response.xpath(...)
# urlopen etc. write x to file in central dir
# link `b`
def parse_buddy(self, response):
# Download link
y = response.xpath(...self.info...)
# urlopen etc. write y to file in central dir
I haven't gotten to the page turning part yet and am worried whether that will work as intended (I'm fiddling with the merge function atm, getting x
s and y
works fine for one page). I don't care in what order the x
s and y
are gotten as long as it's before pair_function
and 'turning the page' (when the parse function should be again).
I have looked at a couple other SO questions like this but I haven't been able to get a clear answer from them. My basic problem is I'm unsure as to how exactly the asynchronicity is implemented (it doesn't seem to explained in the docs?).
EDIT: To be clear what I'm scared will happen is that yield scrapy.Request(nextpage_url)
will be called before the previous one's have gone through. I'm now thinking I can maybe safe guard against that by just appending to start_urls
(as I've done in the code) after everything has been done (the logic being this should result in the parse
function being called on the appended url?