1

First, highest appreciation for all of your work answering noob questions like this one.

Second, as it seems to be a quite common problem I was finding (IMO) related questions such as: Scrapy: Wait for a specific url to be parsed before parsing others

However, at my current state of understanding it is not straightforward to adapt the suggestions in my specific case and I would really appreciate your help.

Problem Outline: running on (Python 3.7.1, Scrapy 1.5.1)

I want to scrape data from every link collected on pages like this https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1

then from all links on another collection

https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650

I manage to get the desired information (only two elements shown here) if I run the spider for one (e.g. page 1 or 650) at a time. (Note that I restircted the length of links that is crawled per page to 2.) However, once I have multiple start start_urls (setting two elements in the list [1,650] in the code below) the parsed data is no more consistent. Apparently at least one element is not found by xpath. I am suspecting some (or a lot of) incorrect logic how I handle/pass the requests that leads not to the intendet order for parsing.

Code:

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }    
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            print('#### START REQUESTS: ',url)
            yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

Command prompt

2019-03-18 15:42:27 [scrapy.core.engine] INFO: Spider opened
2019-03-18 15:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-18 15:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
2019-03-18 15:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1> (referer: None)
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
##### PARSING:  /gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort
##### PARSING:  /gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn
2019-03-18 15:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650> (referer: None)
##### PARSING:  /gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue
##### PARSING:  /gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule

2019-03-18 15:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Blinnenhorn/Corno Cieco
Comments:  Am Samstag Aufstieg zur Corno Gries Hütte, ca. 2,5h ab All Acqua. Zustieg problemslos auf guter Spur. Zur Verwunderung waren wir die einzigsten auf der Hütte. Danke an Monika für die herzliche Bewirtung...
Length of dict extracted from Route: 27

2019-03-18 15:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Cima Portule
Comments:  Sehr viel Schnee in dieser Gegend und viel Spirarbeit geleiset, deshalb auch viel Zeit gebraucht.
Length of dict extracted from Route: 19

2019-03-18 15:42:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Schwändeliflue
Comments:  Wege und Pfade meist schneefrei, da im Gebiet viel Hochmoor ist, z.t. sumpfig.  Oberhalb 1600m und in Schattenlagen bis 1400m etwas Schnee  (max.Schuhtief).  Wetter sonnig und sehr warm für die Jahreszeit, T-Shirt - Wetter,  Frühlingshaft....
Length of dict extracted from Route: 17

2019-03-18 15:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Beaufort
2019-03-18 15:42:40 [scrapy.core.scraper] **ERROR: Spider error processing <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
Traceback (most recent call last):
  File "C:\Users\Lenovo\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\Lenovo\Dropbox\Code\avalanche\scrapy\slf1\slf1\spiders\slf_spider1.py", line 38, in parse_gipfelbuch_item
    print('Comments: ', fields['Verhältnis-Beschreibung'])
**KeyError: 'Verhältnis-Beschreibung'****
2019-03-18 15:42:40 [scrapy.core.engine] INFO: Closing spider (finished)

Question: How do I have to structure the first (for links) and second (for content) parsing commands correctly? Why is the "PARSE OUTPUT" not in the order i would expect (first for page 1, links top to bottom, then page 2, links top to bottom)?

I already tried to reduce the number of CONCURRENT_REQUESTS = 1 and DOWNLOAD_DELAY = 2.

I hope the question is clear enough... big thanks in advance.

Jossy
  • 589
  • 2
  • 12
  • 36
bebissig
  • 23
  • 1
  • 3

1 Answers1

1

If the problem is to visit more URLs at the same time, you can visit one by one, using the signal spider_idle (https://docs.scrapy.org/en/latest/topics/signals.html).

The idea is the following:

1.start_requests only visits the first URL

2.when the spider gets idle, the method spider_idle is called

3.the method spider_idle deletes the first URL and visits the second URL

4.so on...

The code would be something like this (I didn't try it):

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }   
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(SlfSpider1Spider, cls).from_crawler(crawler, *args, **kwargs)
        # Here you set which method the spider has to run when it gets idle
        crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
        return spider

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        # the spider visits only the first provided URL
        url = self.start_urls[0]:
        print('#### START REQUESTS: ',url)
        yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

    # When the spider gets idle, it deletes the first url and visits the second, and so on...
    def spider_idle(self, spider):
        del(self.start_urls[0])
        if len(self.start_urls)>0:
            url = self.start_urls[0]
            self.crawler.engine.crawl(Request(url, callback=self.parse_verhaeltnisse, dont_filter=True), spider)
  • Thanks! But..Now, the start urls appear to be processed "in order". However, meanwhile it turned out that even if I only scrapce one page (only one start url) I have problems. More specifically, if I increase the link number to e.g. 5, only for two out of five (apparently consistently) I get expected feedback. For the others the same error message. Also, the order of the five, as prompted, is changing from run to run. Typically two of the first three links succeed. Note, that any of the links in some runs is sucessfully parsed... so I think differences in page html is excluded. Any more ideas? – bebissig Mar 19 '19 at 16:43
  • 1) The order is not relevant. 2) Sometimes I encountered similar errors caused by Scrapy parsing various URLs at the same time. My idea is to change also parse_verhaeltnisse method, and make it process URLs one by one, like the new method start_request; you should move "the brain" of the spider to spider_idle: every time the spider gets idle, this method decides wich new URL has to be visited. – Stefano Fiorucci - anakin87 Mar 20 '19 at 09:37
  • Good suggestion. Implemented similar approach to what you suggested also for sublinks. Does work in the sense that I can fully control order of reading, but still same problem with the content. I think I found the problem somewhere else. The page requires a login to extract all data from the sublinks. Only two are shown without login, then I only get the header explaining the key error. I'll work on a solution and keep posted for reference. – bebissig Mar 21 '19 at 15:00
  • For reference. Finally it turned out that i had to disable cookies, then I could have multiple reads. Thank you anyways. --> COOKIES_ENABLED = False – bebissig Mar 23 '19 at 13:48