0

I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.

How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?

Crypto
  • 1,217
  • 3
  • 17
  • 33
  • Possible duplicate of [how do i merge results from target page to current page in scrapy?](https://stackoverflow.com/questions/8467700/how-do-i-merge-results-from-target-page-to-current-page-in-scrapy) – gdvalderrama May 10 '19 at 12:27

1 Answers1

1

The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.

See this example from the Scrapy docs on the issue: http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    # parse response and populate item as required
    request = Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    # parse response and populate item as required
    item['other_url'] = response.url
    return item

Is your third piece of data on a page linked from the first page or the second page?

If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.

If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.

NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.

The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.

psion5mx
  • 83
  • 5
  • What about if I want to send multiple requests from parse_page1? Do I just return a list of requests? – Crypto Jan 13 '14 at 12:52