0

I have here a python web scraping tool script, I need to validate the url if its an existing website by testing connectivity to the website. Can anyone help me to implement this in my code?

Here's my code:

import sys, urllib

while True:
    try:
        url= raw_input('Please input address: ')
        webpage=urllib.urlopen(url)
        print 'Web address is valid'
        break
    except:
        print 'No input or wrong url format usage: http://wwww.domainname.com/ '
        print 'Please try again'
def wget(webpage):
        print '[*] Fetching webpage...\n'
        page = webpage.read()
        return page      
def main():
    sys.argv.append(webpage)
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_get URL'
        return
    print wget(sys.argv[1])

if __name__ == '__main__':
    main()

EDIT: I have a code here that I extracted from another stackoverflow post. This code works and I just want it to integrate to my code. I have tried to integrate myself but get errors instead. Can anyone help me do this? Here's the code:

from urllib2 import Request, urlopen, URLError
req = Request('http://jfvbhsjdfvbs.com')
try:
    response = urlopen(req)
except URLError, e:
    if hasattr(e, 'reason'):
        print 'We failed to reach a server.'
        print 'Reason: ', e.reason
    elif hasattr(e, 'code'):
        print 'The server couldn\'t fulfill the request.'
        print 'Error code: ', e.code
else:
    print 'URL is good!'
user3034404
  • 17
  • 2
  • 6
  • 1
    Looks nice, only that your `while True` is executed before you call main. – Hyperboreus Dec 10 '13 at 17:22
  • I'd rather check the response code, look at [this](http://stackoverflow.com/questions/1140661/python-get-http-response-code-from-a-url) post – Jan Vorcak Dec 10 '13 at 17:25
  • yes that's what I need but i dont know how to implement it in my code. So im asking for help if anyone can help me do this – user3034404 Dec 10 '13 at 17:29
  • @Hyperboreus what do you mean? – user3034404 Dec 10 '13 at 17:35
  • @user3034404 A python script is execute top to bottom, in your case 1. your `while` with its suite, then two `defs` (adding the functions to the scope) and then the condition which maybe invokes `main`. By this order, your `while` is executed first and your `main` last in case the condition holds. – Hyperboreus Dec 10 '13 at 17:58
  • ahh yes, because it needs to check if the user input is valid e.g. if its in the correct format or no user input so it loops until it hits the right URL. However, it takes any url like http://www.domain.com/ because it is a correct format. I want to add another test to check the connectivity of the url. – user3034404 Dec 10 '13 at 18:38

2 Answers2

1

Maybe this snippet helps you to understand why your main is executed after the while:

print 'Checkpoint Alpha'

while True:
    print 'Checkpoint Bravo'
    if raw_input ('x for break: ') == 'x': break

print 'Checkpoint Charlie'

def main():
    print 'Checkpoint Foxtrott'

print 'Checkpoint Delta'

if __name__ == '__main__':
    print 'Checkpoint Echo'
    main()
    print 'Checkpoint Golf'

print 'Checkpoint Hotel'
Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • 1
    @KDawG You can take the officer out of the Air Force, but you can't take the Air Force out of the officer. Tally Ho! – Hyperboreus Dec 10 '13 at 22:54
0

Following should help you -

visited = []

in while loop - 
in try:
    url= raw_input('Please input address: ')
    if url in visited: 
        print "Already visited. Continue"
    visited.append(url)
    webpage=urllib.urlopen(url)
    [...]
Arovit
  • 3,579
  • 5
  • 20
  • 24