1

I want to scrape a page like - I'm using scrapy and python for the same...

I want to scrape the button which you can see in the below pic (left pic)

http://postimg.org/image/syhauheo7/

When I click the button in green saying View Code, It does three things:

  1. Redirect to another id.
  2. Opens a popup containing code
  3. Show the code on the same page as can be seen in the above pic on right

How can I scrape the code using scrapy and python framework?

user2373137
  • 956
  • 1
  • 8
  • 10

1 Answers1

3

Here's your spider:

from scrapy.http import Request
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class VoucherItem(Item):
    voucher_id = Field()
    code = Field()


class CuponationSpider(BaseSpider):
    name = "cuponation"
    allowed_domains = ["cuponation.in"]
    start_urls = ["https://www.cuponation.in/babyoye-coupons"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        crawled_items = hxs.select('//div[@class="six columns voucher-btn"]/a')
        for button in crawled_items:
            voucher_id = button.select('@data-voucher-id').extract()[0]

            item = VoucherItem()
            item['voucher_id'] = voucher_id
            request = Request("https://www.cuponation.in/clickout/index/id/%s" % voucher_id,
                              callback=self.parse_code,
                              meta={'item': item})
            yield request

    def parse_code(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['code'] = hxs.select('//div[@class="code-field"]/span/text()').extract()

        return item

If you run it via:

scrapy runspider <script_name.py> --output output.json

you'll see the following in the output.json:

{"voucher_id": "5735", "code": ["MUM10"]}
{"voucher_id": "3634", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "5446", "code": ["APP20"]}
{"voucher_id": "5558", "code": ["No code for this deal"]}
{"voucher_id": "1673", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "3963", "code": ["CNATION150"]}
{"voucher_id": "5515", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4313", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4309", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "1540", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4310", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "1539", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4312", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4311", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "2785", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "3631", "code": ["Deal Activated. Enjoy Shopping"]}
{"voucher_id": "4496", "code": ["Deal Activated. Enjoy Shopping"]}

Happy crawling!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • thats exactly what i wanted :) Thanks a lot. One more thing... how can i get the url.. for example on clicking https://www.cuponation.in/clickout/out/id/5561 I want to capture the url of the site its been replaced too ie http://track.in.omgpm.com/?AID=369188&MID=350174&PID=9644&CID=3651763&WID=42170&UID=3356ebb745665321521c96e02fb4f684&redirect=http%3A%2F%2Fwww.bagskart.com%2Fclearance-sales%2Fbuy1-get1.html%3Futm_source%3Domg%26utm_source%3Domg in this case – user2373137 May 12 '13 at 14:53
  • Sounds like a separate question, but, take a look at `response.url` property, [redirect middleware](http://doc.scrapy.org/en/0.16/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.redirect), request meta [special keys](http://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys). Hope that helps. – alecxe May 12 '13 at 21:40
  • how did you figured out https://www.cuponation.in/clickout/index/id/ as the url to be scraped for the purpose. – user2373137 May 20 '13 at 07:56
  • Just opened browser developer tools, network tab and saw the request. – alecxe May 20 '13 at 07:57