Scraping "older" pages with scrapy, rules and link extractors

Question

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.

I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!

import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(crawlspider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.rotoworld.com/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

score 0 · Answer 1 · answered Jun 10 '18 at 19:38

0

My suggestion: Selenium

If you want to change of page automatically, you can use Selenium WebDriver. Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.

Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.

Hope it helps.

answered Jun 10 '18 at 19:38

reisdev

3,215
2
17
38

hmmm thanks for the quick response. I started with BeautifulSoup, then when I learned that I wouldn't be able to go to different links with that, i checked out Selenium. Someone suggested I check out scrapy because it "could do what selenium does" and more. Lol. So you are telling me there is no way with scrapy to scrape the older pages? – Jordan Freundlich Jun 10 '18 at 19:51
It can do, but not always. I was trying to do it just with Scrapy, but, sometimes, `Selenium` works better, because it can wait for a tag to be visible, clickable and lots of things. – reisdev Jun 10 '18 at 19:52
You can use `Selenium` inside your spider, you'll need just a bit of modifications. If you take a look at my code, you'll see it. – reisdev Jun 10 '18 at 19:54
Alright cool. Ill check it out. If I have questions, can I ask you here? – Jordan Freundlich Jun 10 '18 at 20:01
For shure. Edit your post including your 'new' questions. – reisdev Jun 10 '18 at 20:03
word. What scraper in your github do you suggest checking out? – Jordan Freundlich Jun 10 '18 at 20:11
the zapimoveis spider. focus on the `parse` method. It uses the Selenium to change between pages. Take a look at my project's dependencies, also. – reisdev Jun 10 '18 at 20:18

score 0 · Answer 2 · answered Jun 10 '18 at 21:53

No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.

import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(CrawlSpider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.<DOMAIN>/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

        older = response.css('input#cp1_ctl00_btnNavigate1')
        if not older:
            return

        inputs = response.css('div.aspNetHidden input')
        inputs.extend(response.css('div.RW_pn input'))

        formdata = {}
        for input in inputs:
            name = input.css('::attr(name)').extract_first()
            value = input.css('::attr(value)').extract_first()
            formdata[name] = value or ''

        formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
        formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
        del formdata['ctl00$cp1$ctl00$btnFilterResults']
        del formdata['ctl00$cp1$ctl00$btnNavigate1']

        action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'

        yield FormRequest(
            action_url,
            formdata=formdata,
            callback=self.parse
        )

Be carefull you need to replace all to corrent one in my code.

Jordan Freundlich, I didn't test how it work after page number 5. I do not know how will it work when 'ctl00$cp1$ctl00$hidPageLastLine' will equal zero. — Oleg T., Jun 10 '18 at 21:56
yea this really confused me haha. Can anyone help figure this out? — Jordan Freundlich, Jun 11 '18 at 00:49
Formdata is dict with fields to send during POST request. If you explore how website work pressing F12 in browser and go to Network Tab you will understand it. — Oleg T., Jun 11 '18 at 04:10
I ran the code as it is. There was an error at with the start_urls, so i took out the "". Then the code ran fine, but only put out the first page's information. I have looked at the Network Tab, and when I click "older" a million things pop up, not exactly sure what I am supposed to do with that. I will try and research "formdata" and the Network Tab through out the day while at work. — Jordan Freundlich, Jun 11 '18 at 14:32
Did you change in both sides? Did you change it for action_url variable? — Oleg T., Jun 11 '18 at 14:38

SIM · Accepted Answer · 2018-06-12T19:57:10.053

0

If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.

I would do something like below to get the items (assuming you have already installed selenium in your machine):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)

while True:
    for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='pb']"))):
        player = item.find_element_by_xpath(".//div[@class='player']/a").text
        player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
        print(player)

    try:
        idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='date']"))).text
        if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
            break
        wait.until(EC.presence_of_element_located((By.XPATH, "//input[@id='cp1_ctl00_btnNavigate1']"))).click()
        wait.until(EC.staleness_of(item))
    except:break

driver.quit()

edited Jun 12 '18 at 19:57

answered Jun 11 '18 at 11:19

SIM

21,997
5
37
109

Hey SIM, thanks for checking back in, and helping me... again.... :) I just got a new laptop so have to re-install Selenium, but will try to do that while at work today. Ill give this code a go once I do, and let you know what happens. I assume I can just put in all the items i want, and have them printed out? Do I have to do anything after I tell the webdriver to click the next page? Or will it return the while true statement again? Thanks for the help! – Jordan Freundlich Jun 11 '18 at 14:29
You don't have to do anything other than running the script. When you run it you can see that it will keep clicking on the `older buttoon` until there is nothing left and provide you with the names from each page. – SIM Jun 11 '18 at 15:15
So this works! I am in awe at how cool this is... Thanks so much SIM! I have a few more questions. After changing the xpath syntax of my original items, they all work except for the team and position. Any idea, how I can fix those to also work with Selenium? Is there a way to tell script to stop at a certain point? There is data going all the way back to February of 2004 that I want, but you can keep going further back with "older" pages even though there is nothing to scrape? – Jordan Freundlich Jun 11 '18 at 23:32
I figured out the team and position scripts. – Jordan Freundlich Jun 12 '18 at 03:38
O.K so I am running into a new problem, @SIM. I have tried to run the code twice, and it gets going. When it gets to OCtober 7th of 2016, it spits out an error. The error reads, "UnicodeEncodeError: 'charmap' codec can't encode character '\u2009' in position 322: character maps to ". Any ideas how to get around this? – Jordan Freundlich Jun 12 '18 at 19:10
Ok I will check what's wrong with it. I forgot to go it through. I wished to do, though. – SIM Jun 12 '18 at 19:39
Check out the edit. i have put there a limit when to stop the crawler. Just put there any preferable date to break (make sure the date is available in that page). Also I tried to encoding the print statement to bypass any issue. Let me know the update. – SIM Jun 12 '18 at 19:58
When I add the player = player.encode() and try to run it it errors: f.write(player + "," + position + "," + team + "," + report.replace(",","|") + "," + impact.replace(",", "|") + "," + date + "," + source + "\n") TypeError: can't concat str to bytes – Jordan Freundlich Jun 12 '18 at 22:31
In case of writing the structure is different. Check out [this link](https://stackoverflow.com/questions/934160/write-to-utf-8-file-in-python) to get the clarity. Btw, does the scraper stop now as intended? – SIM Jun 12 '18 at 22:39
Something like this `f = open(filename, "w", encoding="utf-8")` and when you do so, take out this portion `player = player.encode()`. – SIM Jun 12 '18 at 22:46
I have not tried stopping it yet. Im just letting it run to see if it gives me the error at the same place. Is there a way to use the idate part to tell it where to start? – Jordan Freundlich Jun 13 '18 at 03:27
The script finished running and I got all 329,000 containers! Thanks so much for the help man. I am really excited to learn how to clean the data up and work with this data. I am looking into the best ways of doing so, some SQL server. Got any suggestions? – Jordan Freundlich Jun 14 '18 at 23:52

Scraping "older" pages with scrapy, rules and link extractors

3 Answers3