I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[@class='pb']"):
player = item.xpath(".//div[@class='player']/a/text()").extract_first()
position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[@class='report']/p/text()").extract_first()
date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[@class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}