I am using Python and Scrapy for this question.
I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.
So, using Scrapy, the idea in pseudo-code is:
links = getlinks(A)
for link in links:
B = getpage(link)
C = getpage(B)
image = getimage(C)
However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('...')
items = []
for link in links:
item = CustomItem()
item['name'] = link.select('...')
# TODO: Somehow I need to go two pages deep and extract an image.
item['image'] = ....
How would I go about doing this?
(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)