I am trying to write a Scrapy spider that crawls through all the results pages on the domain:https://www.ghcjobs.apply2jobs.com.... The code should do three things:
(1) Crawl through all the pages 1-1000. These pages are identical, save for being differentiated by the final portion of the URL: &CurrentPage=#.
(2) Follow each link inside the results table containing job postings where the link's class = SearchResult. These are the only links within the table, so I am not in any trouble here.
(3) Store the information shown on the job description page in key:value JSON format. (This part works, in a rudimentary fashion)
I have worked with scrapy and CrawlSpiders before, using the 'rule = [Rule(LinkExtractor(allow=' method of recursively parsing a page to find all the links that match a given regex pattern. I am currently stumped on step 1, crawling through the thousand result pages.
Below is my spider code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors import LinkExtractor
from genesisSpider.items import GenesisJob
class genesis_crawl_spider(CrawlSpider):
name = "genesis"
#allowed_domains = ['http://www.ghcjobs.apply2jobs.com']
start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']
#allow &CurrentPage= up to 1000, currently ~ 512
rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/
index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')]
def parse_inner_page(self, response):
self.log('===========Entrered Inner Page============')
self.log(response.url)
item = GenesisJob()
item['url'] = response.url
yield item
Here is the output of the spider, with a bit of the execution code on top cut off:
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPa
ge=1> (referer: None) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToRes
ults> (referer: https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=
mExternal.returnToResults&CurrentPage=1) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: ===========Entrered Inner Page========
====
2014-09-02 16:02:48-0400 [genesis] DEBUG: https://www.ghcjobs.apply2jobs.com/Pro
fExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResults
2014-09-02 16:02:48-0400 [genesis] DEBUG: Scraped from <200 https://www.ghcjobs.
apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResu
lts>
{'url': 'https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?CurrentPag
e=1&fuseaction=mExternal.returnToResults'}
2014-09-02 16:02:48-0400 [genesis] INFO: Closing spider (finished)
2014-09-02 16:02:48-0400 [genesis] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 930,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 92680,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 611000),
'item_scraped_count': 1,
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 67000)}
2014-09-02 16:02:48-0400 [genesis] INFO: Spider closed (finished)
Currently, I am stuck on objective (1) of this project. As you can see, my spider only crawls through the start_url page. My regex should be targeting the page navigation buttons correctly as I have tested the regex. My callback function, parse_inner_page, is working, as is shown by the debugging comment I inserted, but only on the first page. Am I using 'Rule' incorrectly? I was thinking that maybe the page being HTTPS was somehow to blame...
Jut as a way to tinker a solution, I tried using a manual request for the second page of results; this didn't work. Here is the code for that too.
Request("https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2", callback = 'parse_inner_page')
Can anyone offer any guidance? Is there maybe a better way to do this? I have been researching this on SO / Scrapy documentation since Friday. Thank you so much.
UPDATE: I have resolved the issue. The problem was with the start url I was using.
start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']
Leads to a post-form-submission page that is the result of clicking the "search" button on This page. This runs javascript on the client side to submit a form to the server, which reports the full job board, pages 1-512. However, there exists another hard-coded URL which apparently calls the server without needing to use any client-side javascript. So now my start url is
start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.searchJobs']
And everything is back on track! In the future, check and see if there are any javascript independent URLs for calling server resources.