Html source code of https pages different when fetched manually vs. with HTTPConnection

Asked Jun 18 '15 at 04:54

Active Jul 09 '15 at 17:36

Viewed 60 times

I'm new to python and I've been trying to get the html source code of 'https' pages. Thanks to a previous question, I am now able to extract part of the source code, but not as much as when I manually open the page and look at the source.

Is there a simple way to fetch the entire code that I see when I open the source of an HTTPS page manually using python?

Here's the code I'm currently using:

import http.client
from urllib.parse import urlparse
url = "https://www.google.ca/?gfe_rd=cr&ei=u6d_VbzoMaei8wfE1oHgBw&gws_rd=ssl#q=test"
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('GET', p.path)
resp = conn.getresponse()

text_file = open("google_test_python.txt", "wb")
for i in resp:
    text_file.write(i)
text_file.close()

edited May 23 '17 at 12:14

Community

asked Jun 18 '15 at 04:54

Mike Nelson

What source code are you not getting? – Anand S Kumar Jun 18 '15 at 04:56
I get a file that is more than twice as big if I go to the google test page (url in code above) and Ctrl+U than I do when I run the code above in python. – Mike Nelson Jun 18 '15 at 05:18
Just a guess, but could it be something to do with user-agent strings or the request headers? I assume HTTPConnection sends different headers to your browser. – alexwlchan Jun 18 '15 at 09:26
Sorry, I'm not quite sure I follow. If that's the case, what could I do to solve this issue? Thanks. – Mike Nelson Jun 18 '15 at 21:32

Html source code of https pages different when fetched manually vs. with HTTPConnection

0 Answers0

Linked