Python Selenium accessing HTML source

Question

How can I get the HTML source in a variable using the Selenium module with Python?

I wanted to do something like this:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get("http://example.com")
if "whatever" in html_source:
    # Do something
else:
    # Do something else

How can I do this? I don't know how to access the HTML source.

Write following line before if condition: html_source = browser.page_source — Abdul Majeed, Oct 23 '14 at 13:21

score 252 · Answer 1 · edited Apr 17 '20 at 11:44

252

You need to access the page_source property:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get("http://example.com")

html_source = browser.page_source
if "whatever" in html_source:
    # do something
else:
    # do something else

edited Apr 17 '20 at 11:44

Boris Verkhovskiy

14,854
11
100
103

answered Oct 23 '11 at 15:08

AutomatedTester

22,188
7
49
62

9

Best answer so far! The most immediate and clear way to do this, much more compact that the other, still valid, alternative (`find_element_by_xpath("//*").get_attribute("outerHTML")`( – 5agado Mar 28 '14 at 14:16
25

What if we need to get page source after all the javascript executes.? – Yogeesh Seralathan Jun 13 '14 at 05:58
6

Works only if the page has completely loaded. If the page loads indefinitely this property doesn't work. – TheRookierLearner Oct 19 '14 at 20:10

score 18 · Answer 2 · answered May 16 '20 at 11:12

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
html_source_code = driver.execute_script("return document.body.innerHTML;")
html_soup: BeautifulSoup = BeautifulSoup(html_source_code, 'html.parser')

Now you can apply BeautifulSoup function to extract data...

Dhiraj · Answer 3 · 2018-11-20T07:23:17.223

8

driver.page_source will help you get the page source code. You can check if the text is present in the page source or not.

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("some url")
if "your text here" in driver.page_source:
    print('Found it!')
else:
    print('Did not find it.')

If you want to store the page source in a variable, add below line after driver.get:

var_pgsource=driver.page_source

and change the if condition to:

if "your text here" in var_pgsource:

edited Nov 20 '18 at 07:23

answered Nov 19 '18 at 14:54

Dhiraj

427
8
20

1

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Nic3500 Nov 19 '18 at 17:39

score 5 · Answer 4 · answered Feb 19 '13 at 13:23

5

With Selenium2Library you can use get_source()

import Selenium2Library
s = Selenium2Library.Selenium2Library()
s.open_browser("localhost:7080", "firefox")
source = s.get_source()

answered Feb 19 '13 at 13:23

Milanka

1,742
19
15

12

Can I set a delay and get the latest source? There are dynamic contents loaded using javascript. – CodeGuru Oct 17 '13 at 23:36

score 3 · Answer 5 · edited Sep 29 '18 at 18:42

3

By using the page source you will get the whole HTML code.
So first decide the block of code or tag in which you require to retrieve the data or to click the element..

options = driver.find_elements_by_name_("XXX")
for option in options:
    if option.text == "XXXXXX":
        print(option.text)
        option.click()

You can find the elements by name, XPath, id, link and CSS path.

edited Sep 29 '18 at 18:42

Asclepius

57,944
17
167
143

answered Dec 16 '13 at 11:18

Mahesh Reddy Atla

509
5
12

score 2 · Answer 6 · answered Oct 10 '19 at 17:23

2

You can simply use the WebDriver object, and access to the page source code via its @property field page_source...

Try this code snippet :-)

from selenium import webdriver
driver = webdriver.Firefox('path/to/executable')
driver.get('https://some-domain.com')
source = driver.page_source
if 'stuff' in source:
    print('found...')
else:
    print('not in source...')

answered Oct 10 '19 at 17:23

SysMurff

126
2
14

how does this answer differs from https://stackoverflow.com/a/7866938/2231972 ? – Roman-Stop RU aggression in UA Oct 10 '19 at 17:29

score 1 · Answer 7 · edited Apr 20 '13 at 09:21

1

To answer your question about getting the URL to use for urllib, just execute this JavaScript code:

url = browser.execute_script("return window.location;")

edited Apr 20 '13 at 09:21

Peter Mortensen

30,738
21
105
131

answered Oct 25 '11 at 21:29

Bob Evans

616
6
18

score -7 · Answer 8 · edited Apr 20 '13 at 09:20

-7

I'd recommend getting the source with urllib and, if you're going to parse, use something like Beautiful Soup.

import urllib

url = urllib.urlopen("http://example.com") # Open the URL.
content = url.readlines() # Read the source and save it to a variable.

edited Apr 20 '13 at 09:20

Peter Mortensen

30,738
21
105
131

answered Oct 22 '11 at 18:42

Griffin

644
6
18

Okay then do you know how I can get the URL within Selenium? I want to store the URL in a variable so I can access it with urllib. – user1008791 Oct 22 '11 at 19:07
@user1008791 Does it matter? You're apparently letting the user type it in anyway using raw_input, just do the same but with urllib. – Griffin Oct 22 '11 at 19:10
That was just to make an easy example, the URL will be changing a lot. – user1008791 Oct 22 '11 at 19:40
8

Selenium does many things that urllib doesn't (e.g. execution of JavaScript). – mpenkov Aug 28 '12 at 07:04
Using the urllib here is pointless, why? AutomatedTester has it correct, it is what I do for scanning through HTML source to make sure we don't push development environment code. – Dave Sep 24 '13 at 23:27

Python Selenium accessing HTML source

8 Answers8

Linked

Related