5

I have a webpage : http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/# and I need to extract the table from this webpage.

Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.

So I get empty table < table> < thead> < /thead> < tbody> < /tbody> < /table>

My approach : Now I am trying to open the url in the browser using webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.

Problem : I am not sure how to fetch information from Web browser directly .

Right now i am using Mozilla on windows system.

Closest link found website Link . But it gives which sites are opened and not the content

Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.

Thanks

Community
  • 1
  • 1
raghava.nitk
  • 85
  • 1
  • 1
  • 6

2 Answers2

2

The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.

After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.

Santiclause
  • 870
  • 1
  • 7
  • 12
2

To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.

For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.

example for your case:

from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!

of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • Hi Bernard, I am trying to run your code but it gives error WindowsError: [Error 87] The parameter is incorrect" . I have selenium installed . Any idea on resolving the error. – raghava.nitk Jun 20 '14 at 18:18
  • 1
    @raghava.nitk try adding Firefox path to your system, it looks like Selenium fails to find Firefox. If you do use PhantomJS however just put PhantomJS.exe in the same folder of your script if you don't want to set up the paths, because path to webdriver executable defaults to the current directory, you can change that by changing the first argument of line 2. – Granitosaurus Jun 21 '14 at 10:12