0

I would like to use Python to check if a file/webpage exists based off its response code and act accordingly. However, I have a requirement to use HTTPS and to also provide username and password credentials. I couldn't get it running through curl (doesn't like HTTPS) but had success by using wget (with --spider and --user and --password). I suppose I can try incorporating wget into the script via os.system but it prints out a lot of output that would be very tricky to parse and if the URI does not exist (aka 404), I think gets stuck "awaiting response..".

I've had a look at urllib2 around the web and have seen people do some stuff, but I'm not sure if this addresses my situation and the solutions are always very convoluted (such as Python urllib2, basic HTTP authentication, and tr.im) . Anyway, if I can get some guidance on what the easiest avenue for me to pursue is using python, that would be appreciated.

edit: using the os.system method (and providing wget with "-q") seems to return a different number if the URI exists or not, so that gives me something to work with for now.

Community
  • 1
  • 1
Peter
  • 427
  • 2
  • 7
  • 22
  • Are the username and password to be provided via [Basic access authentication](http://en.wikipedia.org/wiki/Basic_access_authentication), or via some custom login scheme, where the credentials are to be included in the POST? – Jonathon Reinhart Apr 28 '14 at 03:28
  • Just basic auth, the same way curl and wget provide username and password. – Peter Apr 28 '14 at 03:35

3 Answers3

5

You can make a HEAD request using python requests.

import requests
r = requests.head('http://google.com/sjklfsjd', allow_redirects=True, auth=('user', 'pass'))
assert r.status_code != 404

If the request fails with a ConnectionError, the website does not exist. If you only want to check whether a certain page exists, you will get a successful response but the status code will be 404.

Requests has a pretty nice interface so I recommend checking it out. You'll probably like it a lot as it is very intuitive and powerful (while being lightweight).

dominik
  • 5,745
  • 6
  • 34
  • 45
  • Not so sure if this would work, as the web server would return a 401 unauthorized if you are unable to supply username and password. – Peter Apr 28 '14 at 03:41
  • I edited the code to take user and password. The exact way to pass this data depends on the server, though. – dominik Apr 29 '14 at 04:02
  • Thanks for mentioning this great library. I've been working with urllib2 for a year or 2, but I switched as quickly as I could :') – ToonAlfrink Jun 13 '14 at 16:04
1

urllib2 is the way to go to open any web page

urllib2.urlopen('http://google.com')

for added functionality, you'll need an opener with handlers. I reckon you'll only need the https because you're barely extracting any info

opener = urllib2.build_opener(
    urllib2.HTTPSHandler())
opener.open('https://google.com')

add data and it will automatically become a POST request, or so i believe:

opener.open('https://google.com',data="username=bla&password=da")

the object you'll receive will have a code attribute.

That's the basic gist of it, do add as many handlers as you like, i believe they can't hurt. source: https://docs.python.org/2.7/library/urllib2.html#httpbasicauthhandler-objects

ToonAlfrink
  • 2,501
  • 2
  • 19
  • 19
0

You should use urllib2 to check that:

import urllib2, getpass
url = raw_input('Enter the url to search: ')
username = raw_input('Enter your username: ')
password = getpass.getpass('Enter your password: ')
if not url.startswith('http://') or not url.startswith('https://'):
        url = 'http://'+url

def check(url):
        try:
                urllib2.urlopen(url)
                return True
        except urllib2.HTTPError:
                return False

if check(url):
        print 'The webpage exists!'
else:
        print 'The webpage does not exist!'

opener = urllib2.build_opener(
urllib2.HTTPSHandler())
opener.open(url,data="username=%s&password=%s" %(username, password))

This runs as:

bash-3.2$ python url.py
Enter the url to search: gmail.com
Enter your username: aj8uppal
Enter your password: 
The webpage exists!
A.J. Uppal
  • 19,117
  • 6
  • 45
  • 76