0

im trying to crawl viagogo.com i want to crawl on each show from the page: http://www.viagogo.com/Concert-Tickets/Rock-and-Pop im able to get the show on the first page, but when im trying to move the next page it just doesnt crawl! here is my code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from viagogo.items import ViagogoItem
from scrapy.http import Request, FormRequest

class viagogoSpider(CrawlSpider):
    name="viagogo"
    allowed_domains=['viagogo.com']
    start_urls = ["http://www.viagogo.com/Concert-Tickets/Rock-and-Pop"]

    rules = (
        # Running on pages
        Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id="clientgridtable"]/div[2]/div[2]/div/ul/li[7]/a')), callback='Parse_Page', follow=True),

        # Running on artists in title
        Rule(SgmlLinkExtractor(restrict_xpaths=('//*[@id="clientgridtable"]/table/tbody')), callback='Parse_artists_Tickets', follow=True),

    )

     #all_list = response.xpath('//a[@class="t xs"]').extract()

    def Parse_Page(self, response):
        item = ViagogoItem()
        item["title"] = response.xpath('//title/text()').extract()
        item["link"] = response.url
        print 'Page!' + response.url
        yield Request(url=response.url, meta={'item': item}, callback=self.Parse_Page)


    def Parse_artists_Tickets(self, response):
        item = ViagogoItem()
        item["title"] = response.xpath('//title/text()').extract()
        item["link"] = response.url
        print response.url
        with open('viagogo_output', 'a') as f:
            f.write(str(item["title"]) + '\n')
        return item

i cannot understand what im doing wrong, but the output (inside the file) is only the first page shows..

thanks!

1 Answers1

0

This:

yield Request(url=response.url, ...)

is asking Scrapy to crawl again the same page it has crawled before, and not really advancing to next page. Scrapy has a dupefilter enabled by default that avoids making duplicated requests -- that's probably why the second request is not happening and the second callback never being called.

If you want to continue parsing more items from the same response, you can just call the second callback directly passing the response.

def Parse_Page(self, response):
    # extract and yield some items here...
    for it in self.Parse_Page(response):
        yield it

If you want to follow to a different page, you have to yield a Request to a URL not previously seen.

Elias Dorneles
  • 22,556
  • 11
  • 85
  • 107
  • i didnt get it.. when i get the first response, the response.url is on the next page. so the response is not the same – SomeNiceGuy21 Dec 13 '14 at 12:55
  • @SomeNiceGuy21 ``response.url`` is the URL of the request that was made and originate the current response object passed as argument. When you do ``yield Request(url=response.url, ...)`` you're scheduling a request for that same URL to be made again -- and this one will be skipped. Did this help? – Elias Dorneles Dec 13 '14 at 15:41
  • not exactly. i realized that the 'next' button is calling a JS function, but i dont know how to call it.. can you help? – SomeNiceGuy21 Dec 13 '14 at 15:45
  • @SomeNiceGuy21 okay, I can coach you. Here is how you can get the path to the next page: ``response.xpath("//ul[@class='js-pager']/li[@class='js-next']/a/@href")`` -- you can use ``urlparse.urljoin`` with the site url to get the absolute URL. Give it a try. – Elias Dorneles Dec 13 '14 at 17:12
  • thanks! i think im not clrae enoufh. i need to click on the next page, which call to a JS function. the page address is always the same! on the first JS Function i see JSON with all the data (this is a json with parameters). 2 questions: how do i simulate this JSON request in scrapy, and how do i scrap it with scrapy? – SomeNiceGuy21 Dec 14 '14 at 16:21
  • @SomeNiceGuy21 For how to replicate AJAX requests, see this question: http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax Btw, the URL for the next page here shows up as: ``http://www.viagogo.com/Concert-Tickets/Rock-and-Pop/page-2`` <-- not the same addres. Namely, the site works without AJAX too. ;) – Elias Dorneles Dec 14 '14 at 20:06
  • thanks, but the link http://www.viagogo.com/Concert-Tickets/Rock-and-Pop/page-2 is returning the results of the first page! the url may be different but the results are the same, i need to click on 'next' to see the new results.. – SomeNiceGuy21 Dec 14 '14 at 20:08
  • @SomeNiceGuy21 You're right, I hadn't noticed it! Hmmm, yup, you're going to have to replicate the AJAX request and parse the JSON. See the link for the question from my previous comment: basically, the idea is to use the information on your browser Network tab to replicate the request fully (URL, parameters and headers). If you have ``curl`` installed, you can try using the [Copy as cURL](http://www.lornajane.net/posts/2013/chrome-feature-copy-as-curl) feature of your browser and trying replicate the request on the commandline. – Elias Dorneles Dec 14 '14 at 20:29
  • thanks for your patience :) i copied the curl and tried it at: http://onlinecurl.com/, and i did NOT get the JSON, wierd no? – SomeNiceGuy21 Dec 15 '14 at 20:29
  • @SomeNiceGuy21 not really, sorry. – Elias Dorneles Dec 17 '14 at 18:09