2

I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time

import shutil

proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'

PROXY = proxy_username+":"+proxy_password+"@"+hostname+":"+port

options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)

driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)

driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)

driver.close()

This results in a very incomplete HTML, showing only outer brackets of head and body:

html
Out[3]: '<html><head></head><body></body></html>'

Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.

I am lost, here is what I tried

  1. It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
  2. Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
  3. Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)

What am I getting wrong about the connection?

KingOtto
  • 840
  • 5
  • 18

1 Answers1

-1

driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.: https://stackoverflow.com/a/45247539/1387701

Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.

DMart
  • 2,401
  • 1
  • 14
  • 19