0

So, I have seen the tutorials of how to use scrapy and now I can visit links in a given page. But what I want to do is that given a page I want to collect its data (metadata and summary) also I want to visit links in that page and collect their data. This is my code so far (not collected data yet)

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
#from scrapy.item import SpideyItem

class spidey (CrawlSpider):
    name = "spidey"
    allowed_domains = ["wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Game_of_Thrones"]

    rules = (

        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="nw-  body"]//a/@href'))),
    Rule(SgmlLinkExtractor(allow=("http://en.wikipedia.org/wiki/",)), callback = 'parse_item'),

    )

def parse_item(self, response):
    sel = HtmlXPathSelector(response)
    print sel.xpath('//h1[@class="firstHeading"]/span/text()').extract()

So after this I want to collect data of initial page and data present in the links that I visit. I am new into web spiders, any pointer is welcome.

Abhi80
  • 31
  • 2
  • 4

1 Answers1

0

I'm not sure what your question is exactly but if you are asking how to collect data from multiple pages and save it into one item...this is your answer:

https://github.com/darkrho/scrapy-inline-requests

Also if you don't want to do it in inline way, you can always store your item in request.meta and send it in a request with a callback to a function that extracts data from the page.

Check this answer: How can i use multiple requests and pass items in between them in scrapy python

Community
  • 1
  • 1
Bzisch
  • 101
  • 5