I have a site to scrape. On the main page it has story teasers - so, this page will will be our start parsing page. My spider goes from it and collects data about every story - author, rating, publication date, etc. And this is done correctly by the spider.
import scrapy
from scrapy.spiders import Spider
from sxtl.items import SxtlItem
from scrapy.http.request import Request
class SxtlSpider(Spider):
name = "sxtl"
start_urls = ['some_site']
def parse(self, response):
list_of_stories = response.xpath('//div[@id and @class="storyBox"]')
item = SxtlItem()
for i in list_of_stories:
pre_rating = i.xpath('div[@class="storyDetail"]/div[@class="stor\
yDetailWrapper"]/div[@class="block rating_positive"]/span/\
text()').extract()
rating = float(("".join(pre_rating)).replace("+", ""))
link = "".join(i.xpath('div[@class="wrapSLT"]/div[@class="title\
Story"]/a/@href').extract())
if rating > 6:
yield Request("".join(link), meta={'item':item}, callback=\
self.parse_story)
else:
break
def parse_story(self, response):
item = response.meta['item']
number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
[last()-1]/text()').extract()
if number_of_pages:
item['number_of_pages'] = int("".join(number_of_pages))
else:
item['number_of_pages'] = 1
item['date'] = "".join(response.xpath('//span[@class="date"]\
/text()').extract()).strip()
item['author'] = "".join(response.xpath('//a[@class="author"]\
/text()').extract()).strip()
item['text'] = response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract()
item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
/a[@href]/@href').extract()
yield item
So, the data is gathered correctly, BUT we have ONLY THE FIRST page of every story. But every sory has several pages (and has links to the 2nd, 3rd, 4th pages, sometimes 15 pages). That's where the problem rises. I replace yield item with this: (to get the 2nd page of every story)
yield Request("".join(item['list_of_links'][0]), meta={'item':item}, \
callback=self.get_text)
def get_text(self, response):
item = response.meta['item']
item['text'].extend(response.xpath('//div[@id="storyText"]/div\
[@itemprop="description"]/text() | //div[@id="storyText"]\
/div[@itemprop="description"]/p/text()').extract())
yield item
Spider collects next (2nd) pages, BUT it joins them to first page of ANY story. For example the 2nd page of 1st story may be added to the 4th story. The 2nd page of the 5th story is added to the 1st story. And so on.
Please help, how to collect data into one item (one dictionary) if data to be scraped is spread on several web pages? (In this case - how to not let data from different items to be mixed with each other?)
Thanks.