0

I need to scrape this page (which has a form): http://kllads.kar.nic.in/MLAWise_reports.aspx, with Python preferably (if not Python, then JavaScript). I was looking at libraries like RoboBrowser (which is basically Mechanize + BeautifulSoup) and (maybe) Selenium but I'm not quite sure on how to go about it. From inspecting the element, it seems to be a WebForm that I need to fill in. After filling that in, the webpage generates some data that I need to store. How should I do this?

Mathguy
  • 157
  • 11
  • 2
    One option is using scrapy [link](http://doc.scrapy.org/). In order to make form submissions [link](http://doc.scrapy.org/en/latest/topics/request-response.html) can be referenced. – Kadir Oct 07 '15 at 07:00
  • Please read the guide [How do I ask a good question](http://stackoverflow.com/help/how-to-ask), especially the part on Minimal, Complete, and Verifiable example (MCVE). This will help you solve problems for yourself. If you do this and are still stuck you can come back and post your MCVE, what you tried, and what the results were so we can better help you. – JeffC Oct 07 '15 at 20:01

1 Answers1

1

You can interact with the javascript web forms relatively easily in Selenium. You may need to install a webdriver quickly, but besides that all you need to do is find the form using its xpath and then have Selenium select an option from the drop down menu using the option's xpath. For the web page provided that would look something like this:

#import functions from selenium module
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# open chrome browser using webdriver
path_to_chromedriver = '/Users/Michael/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)

# open web page using browser
browser.get('http://kllads.kar.nic.in/MLAWise_reports.aspx')

# wait for page to load then find 'Constituency Name' dropdown and select 'Aland (46)''
const_name = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlconstname"]')))
browser.find_element_by_xpath('//*[@id="ddlconstname"]/option[2]').click()

# wait for the page to load then find 'Select Status' dropdown and select 'OnGoing'
sel_status = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ddlstatus1"]')))
browser.find_element_by_xpath('//*[@id="ddlstatus1"]/option[2]').click()

# wait for browser to load then click 'Generate Report'
gen_report = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="BtnReport"]')))
browser.find_element_by_xpath('//*[@id="BtnReport"]').click()

Between each interaction, you are just giving the browser some time to load before attempting click the next element. Once all the forms are filled out, the page will display the data based on the options selected and you should be able to scrape the table data. I had a few issues when attempting to load data for the first Constituency Name option, but the others seemed to work fine.

You should also be able to loop through all the dropdown options available under each web form to display all the data.

Hope that helps!

  • Thank you! Will Selenium open a window when you call browser.get(url) though? I'm just curious. – Mathguy Oct 08 '15 at 03:38
  • Yes it will. However, there are few ways around this if you want to hide the browser once you have everything functioning correctly. One option is to use a headless webdriver, such as PhantomJS. I've also read about setting up a virtual display for the webdriver to run in using the virtual display module. More on that and other options can be found here: . – Michael Russo Oct 08 '15 at 06:37