1

I am trying to write a Scrapy spider that crawls through all the results pages on the domain:https://www.ghcjobs.apply2jobs.com.... The code should do three things:

(1) Crawl through all the pages 1-1000. These pages are identical, save for being differentiated by the final portion of the URL: &CurrentPage=#.

(2) Follow each link inside the results table containing job postings where the link's class = SearchResult. These are the only links within the table, so I am not in any trouble here.

(3) Store the information shown on the job description page in key:value JSON format. (This part works, in a rudimentary fashion)

I have worked with scrapy and CrawlSpiders before, using the 'rule = [Rule(LinkExtractor(allow=' method of recursively parsing a page to find all the links that match a given regex pattern. I am currently stumped on step 1, crawling through the thousand result pages.

Below is my spider code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors import LinkExtractor
from genesisSpider.items import GenesisJob

class genesis_crawl_spider(CrawlSpider):
    name = "genesis"
    #allowed_domains = ['http://www.ghcjobs.apply2jobs.com']
    start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

    #allow &CurrentPage= up to 1000, currently ~ 512
    rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/
index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')]

def parse_inner_page(self, response):
    self.log('===========Entrered Inner Page============')
    self.log(response.url)
    item = GenesisJob()
    item['url'] = response.url

    yield item

Here is the output of the spider, with a bit of the execution code on top cut off:

2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPa
ge=1> (referer: None) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToRes
ults> (referer: https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=
mExternal.returnToResults&CurrentPage=1) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: ===========Entrered Inner Page========
====
2014-09-02 16:02:48-0400 [genesis] DEBUG: https://www.ghcjobs.apply2jobs.com/Pro
fExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResults
2014-09-02 16:02:48-0400 [genesis] DEBUG: Scraped from <200 https://www.ghcjobs.
apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResu
lts>
        {'url': 'https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?CurrentPag
e=1&fuseaction=mExternal.returnToResults'}
2014-09-02 16:02:48-0400 [genesis] INFO: Closing spider (finished)
2014-09-02 16:02:48-0400 [genesis] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 930,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 92680,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 611000),
         'item_scraped_count': 1,
         'log_count/DEBUG': 7,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 67000)}
2014-09-02 16:02:48-0400 [genesis] INFO: Spider closed (finished)

Currently, I am stuck on objective (1) of this project. As you can see, my spider only crawls through the start_url page. My regex should be targeting the page navigation buttons correctly as I have tested the regex. My callback function, parse_inner_page, is working, as is shown by the debugging comment I inserted, but only on the first page. Am I using 'Rule' incorrectly? I was thinking that maybe the page being HTTPS was somehow to blame...

Jut as a way to tinker a solution, I tried using a manual request for the second page of results; this didn't work. Here is the code for that too.

Request("https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2",  callback = 'parse_inner_page')

Can anyone offer any guidance? Is there maybe a better way to do this? I have been researching this on SO / Scrapy documentation since Friday. Thank you so much.

UPDATE: I have resolved the issue. The problem was with the start url I was using.

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1'] 

Leads to a post-form-submission page that is the result of clicking the "search" button on This page. This runs javascript on the client side to submit a form to the server, which reports the full job board, pages 1-512. However, there exists another hard-coded URL which apparently calls the server without needing to use any client-side javascript. So now my start url is

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.searchJobs']

And everything is back on track! In the future, check and see if there are any javascript independent URLs for calling server resources.

deusofnull
  • 91
  • 1
  • 11

1 Answers1

2

Are you sure Scrapy sees the web page in the same way as you? Nowadays, more and more sites are built by Javascript, Ajax .. And those dynamic content might need a fully functional browser to be fully populated. However, neither Nutch nor Scrapy will handle those out of box.

First of all, you need to make sure the web content you are interested in can be retrieved by scrapy. There are a few ways to do it. I usually use urllib2 and beautifulsoup4 to give it a quick try. And the your start page failed my test.

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1"

>>> html = urllib2.urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> table = soup.find('div', {'id':'VESearchResults'})
>>> table.text
u'\n\n\n\r\n\t\t\tJob Title\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tArea of Interest\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tLocation\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tState\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tCity\xa0\r\n\t\t\t\r\n\t\t\n\n\n\r\n\t\t\t\t\tNo results matching your criteria.\r\n\t\t\t\t\n\n\n'
>>> 

As you can see, "There is no results matching your criteria!" I think you might need to figure out why the content is not populated. Cookies? Post instead of Get? User Agent..etc.

Also, you can use scrapy parse command to help you debug. For example, I use this command quite often.

scrapy parse http://example.com --rules

A few other scrapy commands, maybe Selenium that might be helpful down the road.

Here I am using running scrapy shell in iPython to inspect your start url and also the first record that I can see in my browser contains Englewood and it doesn't exist in the html that scrapy grabbed

Here I am using running scrapy shell in iPython to inspect your start url and also the first record that I can see in my browser contains Englewood and it doesn't exist in the html that scrapy grabbed.

Update:

What you are doing is a really trivial scraping work and you really don't need Scrapy, it is a bit overkill. Here are my suggestions:

  1. Take a look at Selenium (I am assuming you write Python) and make headless Selenium in the end when you try to run it on a server.
  2. You can implement this with PhantomJS which is a much lighter Javascript executor to get your work done. Here is another stackoverflow question that might be helpful.
  3. Several other resources that you can make a career in.
Community
  • 1
  • 1
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • That is very strange that "Englewood" is not showing up in the soup. I found it in the HTML table @ code line 839 in the page source. In regards to your suspission of Ajax calls; I looked at the page console of google chrome with a co-worker and didn't see any Ajax calls... What could be causing the HTML to be invisible to scrapy / beautifulsoup? – deusofnull Sep 03 '14 at 14:49
  • @deusofnull That is a question and that is where the effort should put to. You cannot just inspect element in browser and assume whatever you see there is supposed to show up in Scrapy. Maybe clear your cache, cookies and disable Javascript, then I assume whatever you see in browser should be similar what Scrapy sees. I took a quick look and seems like it is not that easy. – B.Mr.W. Sep 03 '14 at 15:00
  • Oh boy, well that explains a lot. I looked at the page without JS enabled and, not surprisingly, the table was empty... I found an addon to scrapy called scrappyjs on github. Hopefully it may help? Otherwise, are you familiar with any route I could take to provide javascript functionality into scrapy? – deusofnull Sep 03 '14 at 15:37
  • @deusofnull see update, hope it is helpful. Sometimes you can carefully monitor the network tab see which ajex call populate the content and avoid using Selenium or PhantomJS... but in worst scenario. Selenium always works. – B.Mr.W. Sep 03 '14 at 15:54
  • That is very helpful, thank you! I actually just found what seems to be a hard coded URL for the populated search table which is obeying my rules properly now! Will update my question with this new information as I get further along with it! My company actually uses casper/phantom js for spiders, but in some instances have seen really bad performance. This was one such instance, and the reason why I am designing a python spider. – deusofnull Sep 03 '14 at 18:18
  • Thanks for your help! Sharing the 'see-the-page-like-how-the-spider-sees-the-page' approach led me down the right path! Salut! – deusofnull Sep 03 '14 at 21:00