35

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:

The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?

Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...

BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):

|-------------------------------------------------|
|             Title              |    Due Date    |
|-------------------------------------------------|
| Job Title (Clickable Link)     |    1/1/2012    |
| Other Job (Link)               |    3/2/2012    |
|--------------------------------|----------------|

I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.

tshepang
  • 12,111
  • 21
  • 91
  • 136
dru
  • 698
  • 1
  • 9
  • 11
  • Possible duplicate of [how do i merge results from target page to current page in scrapy?](https://stackoverflow.com/questions/8467700/how-do-i-merge-results-from-target-page-to-current-page-in-scrapy) – gdvalderrama May 10 '19 at 12:27

3 Answers3

28

Please, first read the docs to understand what i say.

The answer:

To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter.

how do i merge results from target page to current page in scrapy?

Community
  • 1
  • 1
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • @fortuneRice, not sure if the examples are up to date: http://stackoverflow.com/questions/11150053 http://stackoverflow.com/questions/13910357/how-can-i-use-multiple-requests-and-pass-items-in-between-them-in-scrapy-python/13911764#13911764 – warvariuc Oct 22 '13 at 07:26
  • 1
    this is the relevant part of the docs: http://doc.scrapy.org/en/latest/topics/spiders.html – tback Mar 10 '14 at 16:37
  • 2
    Thanks @tback. OP's RTFD is just not a helpful way of formulating an answer. – kontur Feb 01 '19 at 08:50
25

An example from scrapy documentation:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item
daaawx
  • 3,273
  • 2
  • 17
  • 16
Chitrasen
  • 1,706
  • 18
  • 15
4

You can also use Python functools.partial to pass an item or any other serializable data via additional arguments to the next Scrapy callback.

Something like:

import functools

# Inside your Spider class:

def parse(self, response):
  # ...
  # Process the first response here, populate item and next_url.
  # ...
  callback = functools.partial(self.parse_next, item, someotherarg)
  return Request(next_url, callback=callback)

def parse_next(self, item, someotherarg, response):
  # ...
  # Process the second response here.
  # ...
  return item
Jan Wrobel
  • 6,969
  • 3
  • 37
  • 53