Reading html source in soup retrieved from selenium

Question

driver = webdriver.Firefox()
driver.maximize_window()
driver.get(url)
html_source=driver.page_source   
html = BeautifulSoup(html_source)

Why is html_source and html different . What am I doing wrong here?

score 2 · Answer 1 · answered Jun 25 '15 at 20:17

2

driver.get is not like most other get methods, you only visit the page. You can then obtain the html by using driver.page_source:

driver = webdriver.Firefox()
driver.maximize_window()
driver.get(url)
soup = BeautifulSoup(driver.page_source)

answered Jun 25 '15 at 20:17

PascalVKooten

20,643
17
103
160

Thanks! I have tried you said check edit above. Still the same issue – Abhishek Bhatia Jun 25 '15 at 20:21
@AbhishekBhatia Try reopening the instance (either interactive seession or script). This should work :) Perhaps you're looking at the `html` element and not the `soup` one. – PascalVKooten Jun 25 '15 at 20:22
Can you give example please? Sorry I am new to python. – Abhishek Bhatia Jun 25 '15 at 20:23
@AbhishekBhatia Simply restart python and rerun this code, then inspect `soup` – PascalVKooten Jun 25 '15 at 20:23
My question is with respect to this other question's answer code http://stackoverflow.com/questions/30982176/parse-the-html-code-for-a-whole-webpage-scrolled-down. – Abhishek Bhatia Jun 25 '15 at 20:23
"Tried doesn't work" is really useless to say. How is anyone supposed to help with that? If you don't describe what the issue is no one can help. – PascalVKooten Jun 25 '15 at 20:26
Sorry for mentioning details. It returned not the full html code I required. Thus, `driver.page_source` and `BeautifulSoup(driver.page_source)` are different. – Abhishek Bhatia Jun 25 '15 at 20:28
driver.page_source returns HTML, BeautifulSoup on the driver.page_source returns a **parsed object on the html**, pretty much an interface to interact with the html. What is the child element of , find all tags in html can be answered with the soup object. By definition `driver.page_source != BeautifulSoup(driver.page_source)`. – PascalVKooten Jun 25 '15 at 20:29
Thanks for the info but issue is I want to parse the html code using soup elements given I reading from selenium. When I try to print using soup.prettify() it removes the some of the html code present in driver.page_source. Does this make sense? – Abhishek Bhatia Jun 25 '15 at 20:40

score 1 · Accepted Answer · answered Jun 25 '15 at 20:10

1

If you use calling BeautifulSoup just with one parameter, you parse document as an html one. If one tag is not an HTML valid one, its corrected and document will be modified. You can see Beautiful Soup Specifying the parser to use.

answered Jun 25 '15 at 20:10

Mihai8

3,113
1
21
31

Thanks for info! My question is with respect to this code http://stackoverflow.com/questions/30982176/parse-the-html-code-for-a-whole-webpage-scrolled-down. How do you think I can read the entire html in soup? – Abhishek Bhatia Jun 25 '15 at 20:22

Reading html source in soup retrieved from selenium

2 Answers2