0

I am reading the book "The Ultimate Guide to Web Crawling"

The code used to run the first HTTP get-request is the following:

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url) 
print("We got a {} response code from {}".format(r.status_code, url))

I got the error message:

HTTPSConnectionPool(host='scrapethissite.com', port=443): Max retries exceeded with url: /pages/simple/ (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))

I understand that my request doesn't go the right port. Is it linked to the fact that the website uses the communication protocol HTTPS (vs HTTP)? I am not sure, but it seems to be part of the problem.

I am using Python 3.8 on PyCharm. My SSL version is:

OpenSSL 1.1.1g 21 Apr 2020

I am a beginner in webcrawling. This is why I chose to run an alternative code to run my HTTP get-request, one that would allow me to select the appropriate port and protocol (Source: https://pythonprogramming.net/python-sockets/):

import socket
import ssl    

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET / HTTP/1.1\nHost: "+server+"\n\n"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

I got the HTTP 200 OK status response so it is working well. I get this output in the PyCharm terminal:

b'HTTP/1.1 200 OK\r\nDate: Tue, 12 Jan 2021 14:59:35 GMT\r\nContent-Type: text/html; charset=utf-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nSet-Cookie: __cfduid=d205b0b8e8ce061174412767189bf10b41610463575; expires=Thu, 11-Feb-21 14:59:35 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax\r\nCF-Cache-Status: DYNAMIC\r\ncf-request-id: 0798b515a60000ea04f707d000000001\r\nExpect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"\r\nReport-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=%2FROG7Z2JWZJBMeVNn1IgnJh2TZsqJCi9TJOL3zau98btlLo1nPg4WhGlmOz2SZ6PRep6%2BKZfv0M81fqKOw1l6%2BRbc5M9dErdtyeTsei9Ee%2F2jc0%3D"}],"group":"cf-nel","max_age":604800}\r\nNEL: {"report_to":"cf-nel","max_age":604800}\r\nServer: cloudflare\r\nCF-RAY: 6107be029e27ea04-IAD\r\n\r\n1fb5\r\n<!doctype html>\n\n \n \n

Scrape This Site | A public sandbox for learning web scraping\n \n\n \n \n\n \n \n \n\n \n\n \n\n \n \n \n \n \n \n \n \n Scrape This Site\n \n \n \n \n \n Sandbox\n \n \n \n \n \n Lesson' b's\n \n \n \n \n \n FAQ\n \n \n \n \n \n Login\n \n \n \n \n \n \n \n\n \n var path = document.location.pathname;\n var tab = undefined;\n if (path === "/"){\n tab = document.querySelector("#nav-homepage");\n } else if (path.indexOf("/faq/") === 0){\n tab = document.querySelector("#nav-faq");\n } else if (path.indexOf("/lessons/") === 0){\n tab = document.querySelector("#nav-lessons");\n } else if (path.indexOf("/pages/") === 0) {\n tab = document.querySelector("#nav-sandbox");\n } else if (path.indexOf("/login/") === 0) {\n tab = do' b'cument.querySelector("#nav-login");\n }\n tab.classList.add("active")\n \n\n \n\n \n\n \n \n \n \n \n

\n Scrape This Site\n

\n \n The internet\'s best resource for learning web scraping.\n \n


\n Explore Sandbox\n \n \n Begin Lessons →\n \n \n \n \n \n\n \n\n\n \n \n \n \n Lessons and Videos © Hartley Bro' b'dy 2018\n \n \n \n \n \n\n \n \n\n \n\n \n \n\n \n \n \n PNotify.prototype.options.styling = "bootstrap3";\n $(function(){\n \n });\n \n\n $(function () {\n $(\'[data-toggle="tooltip"]\').tooltip()\n })\n \n\n \n \n $("video").hover(function() {\n $(this).prop("controls", true);\n }, function() {\n $(this).prop("controls", false);\n });\n\n $("video").click(function() {\n if( this.paused){\n this.play();\n }\n else {\n this.pause();\n }\n });\n \n\n \n \n (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n })(window,document,\'script\',\'https://www.google-analytics.com/analytics.js\',\'ga\');\n\n ga(\'create\', \'UA-41551755-8\', \'auto\');\n ga(\'send\', \'pageview\');\n \n\n \n \n !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?\n n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;\n n.push=n;n.loaded=!0;n.version=\'2.0\';n.queue=[];t=b.createElement(e);t.async=!0;\n t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,\n document,\'script\',\'https://connect.facebook.net/en_US/fbevents.js\');\n\n fbq(\'init\', \'764287443701341\');\n fbq(\'track\', "PageView");\n \n \n\n \n \n /* */\n \n \n \n \n \n \n \n \n\n \n \n \n window.dataLayer = window.dataLayer || [];\n function gtag(){dataLayer.push(arguments);}\n gtag(\'js\', new Date());\n\n gtag(\'config\', \'AW-950945448\');\n \n\n\r\n' b'0\r\n\r\n'

The only problem is that I want to scrape this website:

https://scrapethissite.com/pages/simple/

and not:

https://scrapethissite.com

When I replace

server = 'scrapethissite.com'

by:

server = 'scrapethissite.com/pages/simple/'

in the previous code, I get this new error message:

socket.gaierror: [Errno 11001] getaddrinfo failed

My understanding is that the problem is linked to the proxy. Knowing that the problem may be linked to port, socket, proxy, etc., is informative, but I am not sure what/how to fix the code as it is working fine for one website but not the other.

Any help is highly appreciated. Thank you!


Following OneCricketeer's reply, the code is now:

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()

server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)

request = "GET /pages/simple HTTP/1.1\nHost: "+server+"\n\n"

s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)

while (len(result) > 0):
    print(result)
    result = s.recv(4096)

I get HTTP 301 MOVED PERMANENTLY status response.

b'HTTP/1.1 301 MOVED PERMANENTLY\r\nDate: Tue, 12 Jan 2021 15:34:15 GMT\r\nContent-Type: text/html; charset=utf-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nSet-Cookie: __cfduid=d6e32136f617c0b90e7f92a3e391c159f1610465655; expires=Thu, 11-Feb-21 15:34:15 GMT; path=/; domain=.scrapethissite.com; HttpOnly; SameSite=Lax\r\nLocation: https://scrapethissite.com/pages/simple/\r\nCF-Cache-Status: DYNAMIC\r\ncf-request-id: 0798d4d0d700002550fc1c3000000001\r\nExpect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"\r\nReport-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=2moOTvTDPvS65D6d0LvsiZTLDqYcv8OFZvtunIQDq6H%2FKLucm1LOOlMABcnCUjUO9fK4bwd%2BVDiescQ0NyHbu3DxhTCkOUHTvMcilkM%2BdcZnz3A%3D"}],"group":"cf-nel","max_age":604800}\r\nNEL: {"report_to":"cf-nel","max_age":604800}\r\nServer: cloudflare\r\nCF-RAY: 6107f0c7bb432550-IAD\r\n\r\n11f\r\n\nRedirecting...\n

Redirecting...

\n

You should be redirected automatically to target URL: https://scrapethissite.com/pages/simple/. If not click the link.\r\n' b'0\r\n\r\n'

Is there something I missed?

windyboo
  • 5
  • 2

1 Answers1

0

I am using Python 3.8 on PyCharm

Based on your print usage, you are actually using Python2...

In any case, this solution might work for the requests way

import requests 
url = "https://scrapethissite.com/pages/simple/" 
r = requests.get(url, verify=False) 

If you want to use the socket method, you would change GET / to GET /pages/simple, and keep the server as just the domain name

I understand that my request doesn't go the right port.

443 is the correct HTTPS port. The error is stating the SSL version is incorrect

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • @OneCricketeerThank you for your reply! Now I get the HTTP 301 MOVED PERMANENTLY status response. What is wrong? – windyboo Jan 12 '21 at 15:45
  • You'll need to follow redirects... Not sure how to do what with sockets, but you can read the full response, and it should have something like a Location header – OneCricketeer Jan 12 '21 at 16:32