6

I'd like to get the data from inspect element using Python. I'm able to download the source code using BeautifulSoup but now I need the text from inspect element of a webpage. I'd truly appreciate if you could advise me how to do it.

Edit: By inspect element I mean, in google chrome, right click gives us an option called inspect element which has code related to each element of that particular page. I'd like to extract that code/ just its text strings.

user3783999
  • 571
  • 2
  • 7
  • 17
  • You're going to have to describe what you want to do much more clearly. What is an "inspect element"? Please give an example of what you want to do. – MattDMo Jul 30 '14 at 01:14
  • It doesn't use Python, but chrome allows you to `Copy as HTML` if you right click the blue highlighted line in the editor. – Andrew Johnson Jul 30 '14 at 01:20
  • Is there any other way to do it since I'll have to do it for many pages. Also, Copy as HTML does it only for a single line as per my understanding. @AndrewJohnson – user3783999 Jul 30 '14 at 01:22
  • can you not extract it all from the html you have downloaded? – Padraic Cunningham Jul 30 '14 at 01:24
  • Correct. `Copy as HTML` gives you just the selected element from one page. Below I will provide a simple web-scraper that would give you similar output through python automatically. – Andrew Johnson Jul 30 '14 at 01:25
  • inspect element shows the pages HTML and like you said you can get the HTML then when you parse it with BeautifulSoup, to get just the text from inbetween tags, get the whole line and use `.get_text()` – Serial Jul 30 '14 at 01:25
  • The HTML source code doesn't have the code that's in 'inspect element' option. @PadraicCunningham – user3783999 Jul 30 '14 at 01:25
  • BeautifulSoup doesn't extract it. Basically I'd like to extract everything inside SVG, but the HTML doesn't have SVG in the first place. @Serial – user3783999 Jul 30 '14 at 01:27
  • That webpage uses dynamically generated html. I am not aware of any open-source tool that will run javascript from a webpage and let you automatically extract it. – Andrew Johnson Jul 30 '14 at 01:31

4 Answers4

9

If you want to automatically fetch a web page from Python in a way that runs Javascript, you should look into Selenium. It can automatically drive a web browser (even a headless web browser such as PhantomJS, so you don't have to have a window open).

In order to get the HTML, you'll need to evaluate some javascript. Simple sample code, alter to suit:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get("http://google.com")

# This will get the initial html - before javascript
html1 = driver.page_source

# This will get the html after on-load javascript
html2 = driver.execute_script("return document.documentElement.innerHTML;")

Note 1: If you want a specific element or elements, you actually have a couple of options -- parse the HTML in Python, or write more specific JavaScript that returns what you want.

Note 2: if you actually need specific information from Chrome's tools that is not just dynamically generated HTML, you'll need a way to hook into Chrome itself. No way around that.

Jason S
  • 13,538
  • 2
  • 37
  • 42
  • 1
    Thank you very much. It worked perfectly except that I added the location of phantomjs.exe in the second line like this driver = webdriver.PhantomJS(executable_path=phantomjs_path) – user3783999 Jul 30 '14 at 04:47
  • Hi, Thanks for your help. I have implemented this code using a class but does not covert javascript to html (It works ok on command line). Please help me in this regard? – Ahmad Raza May 27 '20 at 16:00
  • Getting these when I ran the code. UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead. FileNotFoundError: [Errno 2] No such file or directory: 'phantomjs' – Senthil Vikram Vodapalli Oct 10 '21 at 08:23
2

I would like to update answer from Jason S. I wasn't able to start phantomjs on OS X

driver = webdriver.PhantomJS()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 74, in start
raise WebDriverException("Unable to start phantomjs with ghostdriver.", e)
selenium.common.exceptions.WebDriverException: Message: Unable to start phantomjs with ghostdriver.

Resolved by answer here by downloading executables

driver = webdriver.PhantomJS("phantomjs-2.0.0-macosx/bin/phantomjs")
Community
  • 1
  • 1
Jakub
  • 101
  • 3
1

Inspect element shows all the HTML of the page which is the same as fetching the html using urllib

do something like this

import urllib
from bs4 import BeautifulSoup as BS

html = urllib.urlopen(URL).read()

soup = BS(html)

print soup.findAll(tag_name).get_text()
Serial
  • 7,925
  • 13
  • 52
  • 71
0

BeautifulSoup could be used to parse the html document, and extract anything you want. It's not designed for downloading. You could find the elements you want by it's class and id.

flyingfoxlee
  • 1,764
  • 1
  • 19
  • 29