1

I'm trying to use httplib to check if each url in a list of 30k+ websites still works. Each url is read in from a .csv file, and into a matrix, and then that matrix goes through a for-loop for each url in the file. Afterwards, (where my problem is), I run a function, runInternet(url), which takes in the url string, and returns true if the url works, and false if it doesn't. I've used this as my baseline, and have also looked into this. While I've tried both, I don't quite understand the latter, and neither works...

def runInternet(url):
    try:
        page = httplib.HTTPConnection(url)
        page.connect()
    except httplib.HTTPException as e:
        return False
    return True

However, afterwards, all the links are stated as broken! I randomly chose a few that worked, and they work when I input them into my browser...so what's happening? I've narrowed down the problem spot to this line: page = httplib.HTTPConnection(url)

Edit: I tried inputting 'www.google.com' in exchange for the url, and the program works, and when I try printing e, it says nonnumeric port...

JPLim
  • 23
  • 5
  • Don't use low(ish)-level HTTP interfaces like the `httplib` - it won't handle a lot of things for you, including modifiers, redirects, cookies... Use at least `urllib/urllib2` or, even better, the `requests` module. If you still insist, at least provide a sample of your data and how are you calling your function as, in theory, it should work for simple and direct URLs just fine. – zwer Jul 17 '17 at 20:02
  • I did initially try using urllib2, but it ended up being way too slow, and crashed on me after it hit the ~2100 mark. But I'll look into requests to see if it works better! – JPLim Jul 17 '17 at 21:02

1 Answers1

0

You could troubleshoot this by allowing the HTTPException to propagate instead of catching it. The specific exception type would likely help understand what is wrong.

I suspect though that the problem is this line:

page = httplib.HTTPConnection(url)

The first argument to the constructor is not a URL. Instead, it's a host name. For example, this code sample passing a URL to the constructor fails:

page = httplib.HTTPConnection('https://www.google.com/')
page.connect()

httplib.InvalidURL: nonnumeric port: '//www.google.com/'

Instead, if I pass host name to the constructor, and then URL to the request method, then it works:

conn = httplib.HTTPConnection('www.google.com')
conn.request('GET', '/')
resp = conn.getresponse()
print resp.status, resp.reason

200 OK

For reference, here is the relevant abridged documentation of HTTPConnection:

class HTTPConnection
 |  Methods defined here:
 |  
 |  __init__(self, host, port=None, strict=None, timeout=<object object>, source_address=None)
 ...
 |  request(self, method, url, body=None, headers={})
 |      Send a complete request to the server.
Chris Nauroth
  • 9,614
  • 1
  • 35
  • 39