0

i have a made a scraper that scrapes data from a website that have the data nested, i mean that to get to the data page i have to click 5 links then i get to the data page where i scrape the data

For every 1st page there are multiple page 2's for every page 2's there are many page 3's and so on

so here i have a parse function for opening each page until i get to the page that has the data and add the data to the item class ad return the item.

But it is skipping a lot of links without scraping data. It is not executing the last parse_link function after 100 or so links*. Well how do i know the parse_link function is not executing ?

it is because i am printing print '\n\n', 'I AM EXECUTED !!!!' and it is not printing after 100 or so links but the code executes parse_then every time

what i want to know is am i doing it right ? is this the right aproch to scrape a website like this

here is the code

# -*- coding: utf-8 -*-
import scrapy
from urlparse import urljoin
from nothing.items import NothingItem

class Canana411Spider(scrapy.Spider):
    name = "canana411"
    allowed_domains = ["www.canada411.ca"]
    start_urls = ['http://www.canada411.ca/']

PAGE 1

    def parse(self, response):
        SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)

            yield scrapy.Request(link, callback=self.parse_next)

PAGE 2

    def parse_next(self, response):

        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_more)

PAGE 3

    def parse_more(self, response):

        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_other)

PAGE 4

    def parse_other(self, response):
        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
        for attr in response.css(SET_SELECTOR):
            linkse = 'a ::attr(href)'
            link = attr.css(linkse).extract_first()
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_then)

PAGE 5

    def parse_then(self, response):
        SET_SELECTOR = '.c411Cities li h3 a ::attr(href)'
        link = response.css(SET_SELECTOR).extract_first()
        link = urljoin(response.url, link)
        return scrapy.Request(link, callback=self.parse_link)

PAGE 6 THE DATA PAGE

    def parse_link(self, response):
        print '\n\n', 'I AM EXECUTED !!!!'
        item = NothingItem()
        namese = '.vcard__name ::text'
        addressse = '.c411Address.vcard__address ::text'
        phse = 'span.vcard__label ::text'
        item['name'] = response.css(namese).extract_first()
        item['address'] = response.css(addressse).extract_first()
        item['phone'] = response.css(phse).extract_first()
        return item

am i doing it right, or is there is a better way that i am missing ?

Shantanu Bedajna
  • 559
  • 10
  • 34
  • 1
    I would read the scrapy docs if I were you. In it you will find rules you can define to follow links along with many other tools to make this much easier – Verbal_Kint May 08 '17 at 07:50
  • 1
    That's basically three questions in one, consider splitting it up. Concerning your question on yield, please check this SO question: http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python?rq=1 – Done Data Solutions May 08 '17 at 11:54
  • got that answer thanks.in a nutshell iterators stores previous values generators don't. they continue from where they left off – Shantanu Bedajna May 08 '17 at 13:12

1 Answers1

1

If there's no conflict (e.g. 1st page cannot contain selectors and links to 3rd and should take into consideration from any page except 2nd or something alike) I'd recommend to flatten rules to extract links. Thus one parse would be enough.

Eugene Lisitsky
  • 12,113
  • 5
  • 38
  • 59