2

I'm new to Python and I've been trying to get the source code of a page and tried several methods on both Python 2 and 3 (here's one)

import urllib

url = "https://www.google.ca/?gfe_rd=cr&ei=u6d_VbzoMaei8wfE1oHgBw&gws_rd=ssl#q=test"
f = urllib.urlopen(url)
source = f.read()
print source

but I keep getting the following error:

Traceback (most recent call last):
  File "C:\Python34\openpage.py", line 4, in <module>
    f = urllib.urlopen(url)
  File "C:\Python27\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 443, in open_https
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 855, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 1274, in connect
    server_hostname=server_hostname)
  File "C:\Python27\lib\ssl.py", line 352, in wrap_socket
    _context=self)
  File "C:\Python27\lib\ssl.py", line 579, in __init__
    self.do_handshake()
  File "C:\Python27\lib\ssl.py", line 808, in do_handshake
    self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

The last line suggest that the error comes from the secure search, but I can't seem to find a way around it.

I have looked at this post, but still no success.

Community
  • 1
  • 1
Mike Nelson
  • 167
  • 1
  • 4
  • 13

2 Answers2

2

You are using https which is a secure protocol. It says

SSL: CERTIFICATE_VERIFY_FAILED

Try http or use ssl https://docs.python.org/2/library/ssl.html

url = "http://www.google.ca
Alex Ivanov
  • 695
  • 4
  • 6
1

Here's a sample code you can try on Python3, using urlparse

import http.client
from urllib.parse import urlparse
url = "https://www.google.ca/?gfe_rd=cr&ei=u6d_VbzoMaei8wfE1oHgBw&gws_rd=ssl#q=test"
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('GET', p.path)
resp = conn.getresponse()
print('resp= {}'.format(resp.read()))

It will work based on your parameters to conn.request() function, though. You could try other method types like HEAD for example and your response will change accordingly.

If you want to test whether your request worked or not, you can always try:

print(resp.status)

In this case, it gives 200. The list of status codes are available here

Some other examples can be found as well.

  • Thanks for this, however I seem to be getting much less information printed out as I do when I save the html file as a webpage and open it in notepad. Any idea why it isn't complete? – Mike Nelson Jun 16 '15 at 07:19
  • I figured my last comment could be a different problem and wanted to accept your answer, so I asked a separate [question](http://stackoverflow.com/questions/30906705/html-source-code-of-https-pages-different-when-fetched-manually-vs-with-httpcon?noredirect=1#comment49851502_30906705). – Mike Nelson Jun 18 '15 at 05:35