Using web driver to get all text from a source page in python

Question

I am using selenium webdriver (firefox) to crawl some data from a website. I just found that opening the web page is slower than just opening the source of that web page. In other words, it took much longer to go to 'www.google.com' than to go to 'view-source:www.google.com'

So I was wondering whether I can use webdriver to get all text from a source page, rather than a normal page.

I tried using driver.page_source for the source page but it returned some mess that I don't want.

score 1 · Answer 1 · answered Aug 12 '16 at 21:29

1

If you only need the source use requests. Install it with pip:

pip install requests

And use it like so:

import requests

r = requests.get("http://google.com/")
# r.content, r.text, r.json(), r.status can be used

For advanced usage refer to the documentation above.

Note: If you need to parse the html use BeautifulSoup and pass it r.content.

answered Aug 12 '16 at 21:29

Simon Kirsten

2,542
18
21

Yes, but I have to use web driver because I need to manually pass the rechaptcha check. – Marco Aug 12 '16 at 22:55
[This](http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) should provide you with options to get the source code. Also, to optimize load speeds you could disable images like [here](http://stackoverflow.com/questions/25214473/disable-images-in-selenium-python). – Simon Kirsten Aug 12 '16 at 23:03
@user3182260 In order to pass the captcha check, you'll probably need to render the page, not just download the source. You might try PhantomJS instead of Selenium + browser. Or, it might render faster in another browser. – jpaugh Aug 13 '16 at 00:12

Using web driver to get all text from a source page in python

1 Answers1