3

I am fairly new to Python, Scrapy and this board, so please bear with me, as I try to illustrate my problem.

My goal is to collect the names (and possibly prices) of all available hotels in Berlin on booking.com for a specific date (see for example the predefined start_url) with the help of Scrapy.

I think the crucial parts are:

  1. I want to paginate through all next pages until the end.
  2. On each page I want to collect the name of every hotel and the name should be saved respectively.

If I run "scrapy runspider bookingspider.py -o items.csv -t csv" for my code below, the terminal shows me that it crawls through all available pages, but in the end I only get an empty items.csv.

Step 1 seems to work, as the terminal shows succeeding urls are being crawled (e.g. [...]offset=15, then [...]offset=30). Therefore I think my problem is step 2. For step 2 one needs to define a container or block, in which each hotel information is contained seperately and can serve as the basis for a loop, right? I picked "div class="sr_item_content sr_item_content_slider_wrapper"", since every hotel block has this element at a superordinate level, but I am really unsure about this part. Maybe one has to consider a higher level (but which element should I take, since they are not the same across the hotel blocks?). Anyway, based on that I figured out the remaining XPath to the element, which contains the hotel name.

I followed two tutorials with similar settings (though different websites), but somehow it does not work here.

Maybe you have an idea, every help is very much appreciated. Thank you!

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.http.request import Request

class HotelItem(Item):
    title = Field()
    price = Field()

class BookingCrawler(CrawlSpider):
    name = 'booking_crawler'
    allowed_domains = ['booking.com']
    start_urls = ['http://www.booking.com/searchresults.html?checkin_monthday=25;checkin_year_month=2016-10;checkout_monthday=26;checkout_year_month=2016-10;class_interval=1;dest_id=-1746443;dest_type=city;offset=0;sb_travel_purpose=leisure;si=ai%2Cco%2Cci%2Cre%2Cdi;src=index;ss=Berlin']
    custom_settings = {
        'BOT_NAME': 'booking-scraper',
        }

    def parse(self, response):
        s = Selector(response)
        index_pages = s.xpath('//div[@class="results-paging"]/a/@href').extract()
        if index_pages:
            for page in index_pages:
                yield Request(response.urljoin(page), self.parse)

        hotels = s.xpath('//div[@class="sr_item_content sr_item_content_slider_wrapper"]') 
        items = []
        for hotel in hotels:
            item = HotelItem()
            item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0]
            item['price'] = hotel.xpath('//div[@class="sr-prc--num sr-prc--final"]/text()').extract()[0]
            items.append(item)

        for item in items:
            yield item
Maik Drop
  • 31
  • 3
  • Try putting a `self.logger.debug(hotels)` in there. That is probably empty, which would mean you need to rework the corresponding XPath expression. Remember to check the contents of the web page (and not the DOM that you see by inspecting in the browser, in Firefox you can use Ctrl+U to inspect the actual HTML content) – Gallaecio Feb 11 '19 at 14:31

1 Answers1

-1

I think the problem may be with your XPath on this line:

hotels = s.xpath('//div[@class="sr_item_content sr_item_content_slider_wrapper"]')

From this SO question it looks like you need to define something more along the lines of:

//div[contains(@class, 'sr_item_content') and contains(@class, 'sr_item_content_slider_wrapper')]

To help you debug further, you could try outputting the contents of index_pages first to see if it is definitely returning what you expect on that level.

Also, check Xpath Visualiser (also mentioned in the question), which can help with building Xpath.

Community
  • 1
  • 1
Bassie
  • 9,529
  • 8
  • 68
  • 159
  • 1
    Thanks a lot. I use "hotels = s.xpath('//div[contains(@class, "sr_item_new")]')" now, since it introduces each hotel block. It helps, as the terminal shows more for each crawled page, but it now returns an "IndexError: list index out of range" for each crawled page. Do you have an idea? Btw, I tried XPath Visualizer, it seems that it cannot handle the booking xml, since it crashes again and again. – Maik Drop Jun 16 '16 at 21:06
  • @MaikDrop No worries! I would guess that maybe the arrays on the lines `item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0] item['price'] = hotel.xpath('//div[@class="sr-prc--num sr-prc--final"]/text()').extract()[0]` are empty. Did you also update the code for these lines? If this post solved your problem you should mark it as an answer, then maybe you can create a new question for the next issue – Bassie Jun 16 '16 at 21:22
  • Yes, i did, for example `item['title'] = hotel.xpath('div[2]/div[1]/div[1]/h3/a/span/text()').extract()[0]`. Still, it does not work. Basically I have the same problem now that it scrapes, but collects 0 items, now with an error on each crawled page. What is my mistake? Based on the hotels path, I use the remaining XPath to the element. Thank you. – Maik Drop Jun 17 '16 at 11:21
  • @MaikDrop Can you check the contents of `hotels` before running the `item['title']` command? It would be worth knowing whether this has the right data before trying to get it out. Without this information it is hard to help you - I suggest editing your question with the output for `hotels`, or if the `hotels` xpath also returns an error, you should post that error! – Bassie Jun 17 '16 at 12:27