0

I'm new to scrapy, I've been able to create a few spiders so far. I would like to write a spider that will crawl Yellowpages, looking for websites that have a 404 response, the spider is working OK, however, the pagination is not working. Any help will be much appreciated. thanks in advance

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
    for listing in response.css('div.search-results.organic div.srp-listing'):

        url = listing.css('a.track-visit-website::attr(href)').extract_first()

        yield scrapy.Request(url=url, callback=self.parse_details)


    # follow pagination links

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
    next_page_url = response.urljoin(next_page_url)
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
    yield{'Response': response,}
oscarQ
  • 17
  • 3
  • Hi David, this is my first time posting here and I was having problems formatting the code. my question is very simple I'm having issues with the pagination of this spider.And not sure what am I missing here – oscarQ Jul 01 '17 at 21:56

1 Answers1

1

I ran your code and found out that there are some errors. In the first loop, you don't check the value of url and sometimes it is None. This error stops the execution, that's why you thought the pagination didn't work.

Here is a working code:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
        for listing in response.css('div.search-results.organic div.srp-listing'):
            url = listing.css('a.track-visit-website::attr(href)').extract_first()
            if url:
                yield scrapy.Request(url=url, callback=self.parse_details)
        next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        if next_page_url:
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
        yield{'Response': response,}
Adrien Blanquer
  • 2,041
  • 1
  • 19
  • 31