2

I have to extract the data from the table from the following website:

http://www.mcxindia.com/SitePages/indexhistory.aspx

When I click on GO, I get a table appended to the page dynamically. I want export those data from the page to a csv file(which I know how to handle), but the source code does not contain any data points.

I have tried looking into the javascript code, when I inspect the elements after the table is generated, I get the data points, but not in the source. I am using mechanize in Python.

I think it is because the page is getting loaded dynamically. What should I do/use?

durron597
  • 31,968
  • 17
  • 99
  • 158
Aakash Anuj
  • 3,773
  • 7
  • 35
  • 47

5 Answers5

2

mechanize doesn't/can't evaluate javascript. The easiest way that I've seen to evaluate javascript is by using Selenium, which will open a browser on your computer and communicate with python.

I answered a similar question here

Community
  • 1
  • 1
Matthew Wesly
  • 1,238
  • 1
  • 13
  • 14
1

I agreed Matthew Wesly comment. We will get the dynamic page using Selenium, iMacro like a addons. It captures the dynamic pages response based on our recording. It also has the JS script capability.

I think thought, for easy extraction we will go for normal Content Fetch logic using urllib2 and urllib packages.

First get the page 'viewstate' parameter. i.e Get all hidden element information from the home page and pass the form information as like the JS script does.

And also pass Content-Type key value exactly. Here your response is in the form of "text/plain; charset=utf-8".

  • Can you please help me with the URL in the question... I am a newbie to JS – Aakash Anuj Jul 30 '13 at 08:07
  • First get the page 'viewstate' parameter. i.e Get all hidden element information from the home page and pass the form information as like the JS script does. How do I do this? – Aakash Anuj Jul 30 '13 at 08:13
  • first fetch the indexhistory.aspx page using urllib2. From that extract
    hidden elements like __EVENTTARGET, __VIEWSTATE, etc. And the POST the content to http://www.mcxindia.com/SitePages/indexhistory.aspx URL. For getting proper request, response - key and value, try to use plugins like firebug or LiveHttpHearder.
    – Deiveegaraja Andaver Jul 30 '13 at 08:24
  • How do I post the content back? And the viewstate is some arbitrary set of text. What does it mean? Is it neccessary to use plugins? i just want to do it with python – Aakash Anuj Jul 30 '13 at 08:26
  • PlugIn firebug gives us, URL request and response information. Here you are posting the content right. Install it and try to identify how the actual page pass the content and what is the action url in the form, all the things we get in one place. – Deiveegaraja Andaver Jul 30 '13 at 08:31
  • I am a newbie to javascript. Could you please help me a bit more with that specific website so that I get the flow of things? – Aakash Anuj Jul 30 '13 at 08:35
  • 1. Get the indexhistory.aspx using req = urllib2.Request(url) method. 2. From that content extract hiddent name and values. 3. Encode the formed content and using POST method pass the content. Ref: http://docs.python.org/2/howto/urllib2.html#data. 4. From that response check whether proper response you will get or not. If its not try to maintain Cookies information. – Deiveegaraja Andaver Jul 30 '13 at 08:53
  • Here is what I tried. Please have a look. http://stackoverflow.com/questions/17942935/post-not-receiving-correct-response – Aakash Anuj Jul 30 '13 at 09:16
  • Update **Content-Type** properly. – Deiveegaraja Andaver Jul 30 '13 at 09:39
  • Using text/plain; charset=utf-8 gives back the same page. Please help ..you are the last hope :( – Aakash Anuj Jul 30 '13 at 09:44
  • Try to use Cookies enabled content request. Check Cookiejar related python library like cookielib. – Deiveegaraja Andaver Jul 30 '13 at 11:20
1

To avoid using javascript aware transports you need to:

  1. Install web debugger into your browser.
  2. Goto that page. Press F12 to open debugger. Reload page.
  3. Analyze contents of 'network' tab. Usually ajax pages downloads data as html fragments or as json. Just look into response tabs of each request made after pressing 'GO' and you will find familiar data.
  4. Now you can create simple urllib/urllib2 downloader for that url.
  5. parse that data and convert to csv.

http://www.mcxindia.com/SitePages/indexhistory.aspx sends POST request with search parameters on each 'GO' and recieves html fragment you need to parse and convert into csv.

So if to simulate that POST - you dont need no new browser window.

denz
  • 386
  • 1
  • 6
  • Really an awesome answer ! So how do I retrieve the html fragment through Python? – Aakash Anuj Jul 30 '13 at 08:16
  • use [this topic](http://stackoverflow.com/questions/3238925/python-urllib-urllib2-post) to create correct `POST`-er. You will also need to wrap response with tag (any tag) and then you can parse that response with [lxml](http://lxml.de/). – denz Jul 30 '13 at 11:31
1

This worked!!!

import httplib 
import urllib 
import urllib2 
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = 'http://www.mcxindia.com/SitePages/indexhistory.aspx'
br.open(url)
response = br.response().read()
br.select_form(nr=0)
br.set_all_readonly(False)
br.form['mTbFromDate']='08/01/2013'
br.form['mTbToDate']='08/08/2013'
response = br.submit(name='mBtnGo').read()
print response
Aakash Anuj
  • 3,773
  • 7
  • 35
  • 47
0

The best thing I personally do while dealing dynamic web pages is use PyQt webkit and try to mimic as a browser, and then pass the URL to the browser and finally getting the HTML after all Javascripts are rendered.

Example Code-

import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage
import bs4 as bs




class Client(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec()

    def on_page_load(self):
        self.app.quit()


url = //your URL
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, "lxml")
// BeautifulSoup stuff
Rajat Soni
  • 600
  • 7
  • 10