1

I am new to scrapy, I am scraping a job based site which consists of positions , i.e, when we click on the position a new page will be opened which consists of data that i need fetch.

For example the page contains a table with the following format,

      Job Title                  Full/Part Time                             Location/Affiliates
1.   Accountant                   Full Time                           Mount Sinai Medical Center (Manhattan)  
2.   Accountant                   Full Time                           Mount Sinai Medical Center (Manhattan) 
3.   Admin Assistant              Full Time                           Mount Sinai Hospital (Queens) 
4.   Administrative Assistant     Full Time                      Mount Sinai Medical Center (Manhattan)  


Page:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 

All the job titles mentioned above are javascript generated links , i need to submit all the javascript links with the values(Finded using firebug).How to submit multiple forms at a time or else how to write one method that loop through all job title links so we can fetch data from each and every link of job title.

Also i need to paginate through all the pages mentioned above, when i click on page 2 , a page opens that consists of same table format with different job positions and so on, how can i paginate through that pages in scrapy.

My code is:

class MountSinaiSpider(BaseSpider):
   name = "mountsinai"
   allowed_domains = ["mountsinaicss.igreentree.com"]
   start_urls = [
       "https://mountsinaicss.igreentree.com/css_external/CSSPage_SearchAndBrowseJobs.ASP?T=20120517011617&",
   ]

# This method is for submitting starting page with some values for clicking "Find Jobs"
   def parse(self, response):
       return [FormRequest.from_response(response,
                                        formdata={ "Type":"CSS","SRCH":"Search Jobs","InitURL":"CSSPage_SearchAndBrowseJobs.ASP","RetColsQS":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","RetColsGR":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","ResultSort":"" },
                                        callback=self.parse_main_list)]

   def parse_main_list(self, response):
       return [FormRequest.from_response(response,
                                        formdata={ "Key":"","Type":"CSS","InitPage":"1","InitURL":"CSSPage_SearchAndBrowseJobs.ASP","SRCH":"earch Jobs","Search":"ISNULL(Requisition.DatePostedExternal, '12/31/9999')¤BETWEEN 1/1/1753 AND Today¥","RetColsQS":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","RetColsGR":"Requisition.Key¤Requisition.JobTitle¤Requisition.fk_Code_Full_Part¤[Requisition.fk_Code_Full_Part]OLD.Description(sysfk_Code_Full_PartDesc)¤Requisition.fk_Code_Location¤[Requisition.fk_Code_Location]OLD.Description(sysfk_Code_LocationDesc)¤Requisition.fk_Code_Dept¤[Requisition.fk_Code_Dept]OLD.Description(sysfk_Code_DeptDesc)¤Requisition.Req¤","ResultSort":"[sysfk_Code_Full_PartDesc]" },
                                        dont_click = True,
                                        callback=self.parse_fir_pag_urls)]


   def parse_fir_pag_urls(self, response):
       print response'
Cœur
  • 37,241
  • 25
  • 195
  • 267
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

1 Answers1

1

The key function is your callback. For example parse method. It is called when a page from start_urls was downloaded, and a response with that page is passed to the parse method as a parameter.

In the parse method you analyze (parse) the page usually using HtmlXPathSelector and collect the data you need from that page putting it into an Item. If you collected everything you need you yield that item and scrapy detects that it's an item and passes it to the pipelines.

If the page you are parsing doesn't contain any data (for example it is a category page) or only part of the data you need, and you found on it a link to other page(s) with [additional] data, instead of yielding an item, you yield a Request instance with URL of the other page and another callback.

FormRequest is a subclass of Request, so you can yield as many FormRequests from a parse method as you need.

When you finally reached the page you need, in the corresponding parse method you extract the data (using HtmlXPathSelector) and yield an Item instance from the method.

warvariuc
  • 57,116
  • 41
  • 173
  • 227