1

I am newbie to scrapy and need to scrape some dataset for data mining project. I need to scrape "http://www.moneycontrol.com/india/stockpricequote/". Follow each link and extract data. I hve written a working scrapy crawler to get data using xpth and css.But i came across this element in page which uses javascript to use populate a tabbed table. xpath is same for each tab.So cant extract data for individual tab and get data stock gain percentage from each tab this is the tabbed element with gainpercentage in 5th row last column

I can scrape data from xpath and css but one part of page gets its from javascript. How can one scrape such data? Also i need data from each tab please tell me a way to do this as other answers use json and i am not familiar with it.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class NewsItem(scrapy.Item):
    name = scrapy.Field()

class StationDetailSpider(CrawlSpider):
    name = 'test2'
    start_urls = ["http://www.moneycontrol.com/india/stockpricequote/"]
    rules = (
    Rule(LinkExtractor(restrict_xpaths="//a[@class='bl_12']"), follow=False, callback='parse_news'),
    Rule(LinkExtractor(allow=r"/diversified/.*$"), callback='parse_news')
)


    def parse_news(self, response):

        item = NewsItem()
        NEWS1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        TIME1_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'
        NAME_SELECTOR = 'div#disp_nse_hist tr:nth-child(5) > td:nth-child(4)::text'

        print("------------------------------------starting extraction------------")
        item['name']=response.css(NAME_SELECTOR).extract_first()
        item['time1']=response.css(TIME1_SELECTOR).extract_first()
        item['news1']=response.css(NEWS1_SELECTOR).extract()
        return item

2 Answers2

0

This is covered here https://stackoverflow.com/a/8594831/7892562

What you are talking about is scraping AJAX pages, pages that can dynamically load new content without having to reload the entire page.

Follow the instructions and you should have no problem. As an example from the page you listed, when you click a different timeframe (week, month, year, etc), a request is made to

http://www.moneycontrol.com/stocks/company_info/get_histprices.php?ex=B&sc_id=B3M&range=7

As you can see, the url has 3 query parameters passed to it. The last two indicate the company ID and the range of days for the historical pricing. Follow that link and you'll see what I'm talking about.

Given this knowledge, you should be able to figure out how to modify your spider to scrape this information.

Community
  • 1
  • 1
Joe D
  • 378
  • 1
  • 9
  • I dont think the gain percentage is accurate.Could you tell me how you got this link. If your link is for 3M india , how do i find the sc_id of the company to call scrappy to follow this lnk – Sameer Mittal Apr 21 '17 at 09:46
0

Check out splash: http://splash.readthedocs.io/en/stable/, it's a rendering service for scrapy, that will allow you to crawl javascript based web sites.

You can also create your own downloader middleware and use Selenium: How to write customize Downloader Middleware for selenium and Scrapy?

Hope this helps.

Community
  • 1
  • 1
Adrien Blanquer
  • 2,041
  • 1
  • 19
  • 31