I would love to know what you guys think about this please. I have researched for a few days now and I can't seem to find where I am going wrong. Any help will be highly appreciated.
I want to systematically crawl this url: Question site using the pagination to crawl the rest of the pages.
My current code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from acer.items import AcerItem
class AcercrawlerSpider(CrawlSpider):
name = 'acercrawler'
allowed_domains = ['studyacer.com']
start_urls = ['http://www.studyacer.com/latest']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
questions= Selector(response).xpath('//td[@class="word-break"]/a/@href').extract()
for question in questions:
item= AcerItem()
item['title']= question.xpath('//h1/text()').extract()
item['body']= Selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract()
yield item
When I ran the spider it doesn't throw any errors but instead outputs inconsistent results. Sometimes scraping an article page twice. I am thinking it might be something to do with the selectors I have used but I can't narrow it any further. Any help with this please?