I am fairly new to Python, Scrapy and this board, so please bear with me, as I try to illustrate my problem.
My goal is to collect the names (and possibly prices) of all available hotels in Berlin on booking.com for a specific date (see for example the predefined start_url) with the help of Scrapy.
I think the crucial parts are:
- I want to paginate through all next pages until the end.
- On each page I want to collect the name of every hotel and the name should be saved respectively.
If I run "scrapy runspider bookingspider.py -o items.csv -t csv" for my code below, the terminal shows me that it crawls through all available pages, but in the end I only get an empty items.csv.
Step 1 seems to work, as the terminal shows succeeding urls are being crawled (e.g. [...]offset=15, then [...]offset=30). Therefore I think my problem is step 2. For step 2 one needs to define a container or block, in which each hotel information is contained seperately and can serve as the basis for a loop, right? I picked "div class="sr_item_content sr_item_content_slider_wrapper"", since every hotel block has this element at a superordinate level, but I am really unsure about this part. Maybe one has to consider a higher level (but which element should I take, since they are not the same across the hotel blocks?). Anyway, based on that I figured out the remaining XPath to the element, which contains the hotel name.
I followed two tutorials with similar settings (though different websites), but somehow it does not work here.
Maybe you have an idea, every help is very much appreciated. Thank you!
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.http.request import Request
class HotelItem(Item):
title = Field()
price = Field()
class BookingCrawler(CrawlSpider):
name = 'booking_crawler'
allowed_domains = ['booking.com']
start_urls = ['http://www.booking.com/searchresults.html?checkin_monthday=25;checkin_year_month=2016-10;checkout_monthday=26;checkout_year_month=2016-10;class_interval=1;dest_id=-1746443;dest_type=city;offset=0;sb_travel_purpose=leisure;si=ai%2Cco%2Cci%2Cre%2Cdi;src=index;ss=Berlin']
custom_settings = {
'BOT_NAME': 'booking-scraper',
}
def parse(self, response):
s = Selector(response)
index_pages = s.xpath('//div[@class="results-paging"]/a/@href').extract()
if index_pages:
for page in index_pages:
yield Request(response.urljoin(page), self.parse)
hotels = s.xpath('//div[@class="sr_item_content sr_item_content_slider_wrapper"]')
items = []
for hotel in hotels:
item = HotelItem()
item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0]
item['price'] = hotel.xpath('//div[@class="sr-prc--num sr-prc--final"]/text()').extract()[0]
items.append(item)
for item in items:
yield item