1

I am trying to scrap result of the following page :

http://www.peekyou.com/work/autodesk/page=1

with page = 1,2,3,4 ... so on as per the results. So I am getting a php file to run the crawler run it for different page numbers. The code (for a single page) is as follows:

`import sys
 from scrapy.spider import BaseSpider
 from scrapy.selector import HtmlXPathSelector
 from scrapy.contrib.spiders import CrawlSpider, Rule
 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item
 from scrapy.http import Request
 #from scrapy.crawler import CrawlerProcess

 class DmozSpider(BaseSpider):
 name = "peekyou_crawler"

 start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];

 def parse(self, response):

     hxs = HtmlXPathSelector(response)

     discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
     print len(discovery)

     print "Starting the actual file"
     items = hxs.select('//div[@class="resultCell"]')
     count = 0
     for newsItem in items:
        print newsItem

        url=newsItem.select('h2/a/@href').extract()
        name = newsItem.select('h2/a/span/text()').extract()
        count = count + 1
        print count
        print url[0]
        print name[0]

        print "\n"

` The Autodesk result page has 18 pages. When I run the code to crawl all the pages, the crawler only gets data from page 2 and not all pages. Similarly, I changed the company name to be something else. Again, it scraps some pages and rest not. I am getting http response 200 on each of the page although. Also, even I keep running it again, it continues to scrap the same pages always but not all always. Any idea as to what could be the error in my approach or something am I missing ?

Thanks in advance.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Aryabhatt
  • 637
  • 2
  • 7
  • 23

2 Answers2

1

You can add more addresses:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=1",
    "http://www.peekyou.com/work/autodesk/page=2",
    "http://www.peekyou.com/work/autodesk/page=3"
];

You can generate more addresses:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
];

I think you should read about start_requests() and how to generate next url. But I can't help you here, because I don't use Scrapy. I still use pure python (and pyQuery) to create simple crawlers ;)

PS. Sometimes servers check your UserAgent, IP, how fast you grap next page and stop sending pages to you.

furas
  • 134,197
  • 12
  • 106
  • 148
  • I tried looking at the source code of these pages and looks like the results are loaded later and keep showing "loading". Similar thing happens when we try to see the source code. It shows a "loading-small" and gets loaded only after sometime. So my crawler would not find any data to crawl by the time it starts scraping. Any solution to that ? – Aryabhatt May 31 '13 at 20:53
  • If results are loaded later so there must be some javascript using ajax to load this - you can search words "ajax", "post", "get" or "http://" in javascript to find urls of loaded data. I also use firefox+firebug to see what urls are called by browser - it is even faster than searching in javascript. If you have some url you can test it and use it to get datas directly. – furas May 31 '13 at 21:39
1

I'll give you a starting point.

The page you're trying to crawl is loaded via AJAX, this is a problem with scrapy - it cannot deal with dynamic page load via ajax XHR requests. For more info see:

Using browser developer tools, you could notice that there is an outgoing POST request going after the page load. It's going to http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php.

So, simulating this in scrapy should help you to crawl necessary data:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class DmozItem(Item):
    name = Field()
    link = Field()


class DmozSpider(BaseSpider):
    name = "peekyou_crawler"

    start_urls = start_urls = [
        "http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
    ]

    def parse(self, response):
        yield FormRequest(url="http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php",
                          formdata={'id': 'search_work_a10362ede5ed8ed5ff1191321978f12a',
                                    '_': ''},
                          method="POST",
                          callback=self.after_post)

    def after_post(self, response):
        hxs = HtmlXPathSelector(response)

        persons = hxs.select("//div[@class='resultCell']")

        for person in persons:
            item = DmozItem()
            item['name'] = person.select('.//h2/a/span/text()').extract()[0].strip()
            item['link'] = person.select('.//h2/a/@href').extract()[0].strip()
            yield item

It works, but it dumps only the first page. I'll leave it for you to understand how can you get other results.

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195