0

I am using selenium webdriver (firefox) to crawl some data from a website. I just found that opening the web page is slower than just opening the source of that web page. In other words, it took much longer to go to 'www.google.com' than to go to 'view-source:www.google.com'

So I was wondering whether I can use webdriver to get all text from a source page, rather than a normal page.

I tried using driver.page_source for the source page but it returned some mess that I don't want.

Israel Meshileya
  • 293
  • 4
  • 18
Marco
  • 61
  • 3

1 Answers1

1

If you only need the source use requests. Install it with pip:

pip install requests

And use it like so:

import requests

r = requests.get("http://google.com/")
# r.content, r.text, r.json(), r.status can be used

For advanced usage refer to the documentation above.

Note: If you need to parse the html use BeautifulSoup and pass it r.content.

Simon Kirsten
  • 2,542
  • 18
  • 21
  • Yes, but I have to use web driver because I need to manually pass the rechaptcha check. – Marco Aug 12 '16 at 22:55
  • [This](http://stackoverflow.com/questions/7861775/python-selenium-accessing-html-source) should provide you with options to get the source code. Also, to optimize load speeds you could disable images like [here](http://stackoverflow.com/questions/25214473/disable-images-in-selenium-python). – Simon Kirsten Aug 12 '16 at 23:03
  • @user3182260 In order to pass the captcha check, you'll probably need to render the page, not just download the source. You might try PhantomJS instead of Selenium + browser. Or, it might render faster in another browser. – jpaugh Aug 13 '16 at 00:12