2

Using phantomjs selenium beautifulsoup setup to print page source but only returns blank html on https. Returns page source on http. Read a rake of material such as this and this, but no result.

from selenium import webdriver
import urllib.request as urllib2
import requests
import urllibh
from bs4 import BeautifulSoup
import csv
import time

browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any'])
browser.get('https://google.com')
browser.set_window_size(2000, 1500)

soup = BeautifulSoup(browser.page_source, "html.parser")

print(soup)

browser.quit()

Result

<html><head></head><body></body></html>
Complete
Iorek
  • 571
  • 1
  • 13
  • 31
  • You are aware that Google goes to great lengths to prevent their stuff from getting automated / scraped by bots who are not authorized to do so? – SiKing Jul 13 '17 at 21:58
  • I used google as an example, it could be any https page. It has nothing to do with that. – Iorek Jul 14 '17 at 00:32

1 Answers1

1
browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-client-certificate-file=C:\tmp\clientcert.cer', '--ssl-client-key-file=C:\tmp\clientcert.key', '--ssl-client-key-passphrase=1111'])

Had to point the SSL certs at local files.

Iorek
  • 571
  • 1
  • 13
  • 31