I am trying to scrap result of the following page :
http://www.peekyou.com/work/autodesk/page=1
with page = 1,2,3,4 ... so on as per the results. So I am getting a php file to run the crawler run it for different page numbers. The code (for a single page) is as follows:
`import sys
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
#from scrapy.crawler import CrawlerProcess
class DmozSpider(BaseSpider):
name = "peekyou_crawler"
start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];
def parse(self, response):
hxs = HtmlXPathSelector(response)
discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
print len(discovery)
print "Starting the actual file"
items = hxs.select('//div[@class="resultCell"]')
count = 0
for newsItem in items:
print newsItem
url=newsItem.select('h2/a/@href').extract()
name = newsItem.select('h2/a/span/text()').extract()
count = count + 1
print count
print url[0]
print name[0]
print "\n"
` The Autodesk result page has 18 pages. When I run the code to crawl all the pages, the crawler only gets data from page 2 and not all pages. Similarly, I changed the company name to be something else. Again, it scraps some pages and rest not. I am getting http response 200 on each of the page although. Also, even I keep running it again, it continues to scrap the same pages always but not all always. Any idea as to what could be the error in my approach or something am I missing ?
Thanks in advance.