0

I am trying to scrape http://www.sueryder.org/Get-involved/Volunteering/All-Roles, as you can see if you click on the second page the URL of the page doesn't change and it is processed through javascript. I've been trying to use the network tab in "inspect element" but I am completely lost. I managed to scrape the first page of the website and here's the code.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy_demo.items import ScrapyDemoItem


class MySpider(BaseSpider):
  name = "test"
  allowed_domains = ["sueryder.org"]
  start_urls = ["http://www.sueryder.org/Get-involved/Volunteering/All-Roles"]

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      titles = hxs.select('//tr')
      items = []
      for titles in titles:
          item = ScrapyDemoItem()
          item ["link"] = titles.select('td/text()').extract()
          items.append(item)
      return items
user2927435
  • 33
  • 1
  • 6

1 Answers1

0

The JavaScript just submits a form, so use FormRequest:

from scrapy.http import FormRequest

for href in hxs.select('//div[@class="paging pag-num pag-arrows"]//a/@href'):
    target = href.split("'")[1]

    yield FormRequest.from_response(
        response=response, 
        formnumber=0,
        formdata={'__EVENTTARGET': target}
    )

You'll also have to subclass CrawlSpider and setup a Rule to crawl the result pages, as doing it within parse will not work.

Blender
  • 289,723
  • 53
  • 439
  • 496
  • Thank you Blender - I will check this out later today. Seems to make sense – user2927435 Jan 12 '14 at 10:24
  • Ok sorry for this noob question, first time using the class CrawlSpider. Do I place the "href in hxs.select" in the "restrict_xpaths" within the rule? Do I also need an "allow"? – user2927435 Jan 12 '14 at 13:06
  • @user2927435: `CrawlSpider` has a `rules` class attribute that lets you create simple rules to follow links and pass crawled pages off to callbacks. You want to create a rule that follows all of the links in a given page of results and passes the crawled page off to a callback like `parse_result`. – Blender Jan 12 '14 at 21:57