i have a made a scraper that scrapes data from a website that have the data nested, i mean that to get to the data page i have to click 5 links then i get to the data page where i scrape the data
For every 1st page there are multiple page 2's for every page 2's there are many page 3's and so on
so here i have a parse function for opening each page until i get to the page that has the data and add the data to the item class ad return the item.
But it is skipping a lot of links without scraping data. It is not executing the last parse_link function after 100 or so links*. Well how do i know the parse_link function is not executing ?
it is because i am printing print '\n\n', 'I AM EXECUTED !!!!' and it is not printing after 100 or so links but the code executes parse_then every time
what i want to know is am i doing it right ? is this the right aproch to scrape a website like this
here is the code
# -*- coding: utf-8 -*-
import scrapy
from urlparse import urljoin
from nothing.items import NothingItem
class Canana411Spider(scrapy.Spider):
name = "canana411"
allowed_domains = ["www.canada411.ca"]
start_urls = ['http://www.canada411.ca/']
PAGE 1
def parse(self, response):
SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_next)
PAGE 2
def parse_next(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_more)
PAGE 3
def parse_more(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_other)
PAGE 4
def parse_other(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_then)
PAGE 5
def parse_then(self, response):
SET_SELECTOR = '.c411Cities li h3 a ::attr(href)'
link = response.css(SET_SELECTOR).extract_first()
link = urljoin(response.url, link)
return scrapy.Request(link, callback=self.parse_link)
PAGE 6 THE DATA PAGE
def parse_link(self, response):
print '\n\n', 'I AM EXECUTED !!!!'
item = NothingItem()
namese = '.vcard__name ::text'
addressse = '.c411Address.vcard__address ::text'
phse = 'span.vcard__label ::text'
item['name'] = response.css(namese).extract_first()
item['address'] = response.css(addressse).extract_first()
item['phone'] = response.css(phse).extract_first()
return item
am i doing it right, or is there is a better way that i am missing ?