-1

I'm a python newbie what I'm trying to do is create two scripts. One that downloads the webpage information and another scripts that downloads links and outputs a summary of the total number of links downloaded into a list.

First script (Download webpage)

import sys, urllib
def getWebpage(url):
    print '[*] getWebpage()'
    url_file = urllib.urlopen(url)
    page = url_file.read()
    return page
def main():
    sys.argv.append('http://www.funeralformyfat.tumblr.com)
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_get URL'
        return
    else:
        print getWebpage(sys.argv[1])

if __name__ == '__main__':
    main()

Second script (downloads links and outputs a summary of the total number of links downloaded into a list.)

    import sys, urllib
def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'

    for link in links:
        print link

def main():
    sys.argv.append('http://www.funeralformyfat.tumblr.com')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_links URL'
        return
        page = webpage_get.getWebpage(sys.argv[1])
        print_links(page)


if __name__ == '__main__':
    main()

My code runs however it doesn't return any links. Can anyone see the issue?.

NatD
  • 9
  • 3
  • 2
    Welcome to SO! Can you not make a smaller program that fails for the same reason? Find a minimal example, and it will be much easier for yourself and others to answer the question. – Ciro Santilli OurBigBook.com Nov 22 '13 at 13:59
  • Your indentation is wrong from time to time ... I assume you want the print *inside* of the main function in the first script and the for loop *inside* of print_links. Also, for such a task you should consider to not use regex http://stackoverflow.com/questions/1732348 :-) – mwil.me Nov 22 '13 at 13:59
  • @cirosantilli The example here did not have much of a code & is fine. Sometimes full codes have to be posted in order to identify the errors. PS - There was not much of a shortening that could be done for this one. – shad0w_wa1k3r Nov 22 '13 at 14:12

1 Answers1

0
print getWebpage(sys.argv[1]) <---- IndexError: list index out of range

The sys.argv has not yet updated (the append is in the main function which has not been called yet)

You should try the following & see the working for yourself.

import sys, urllib
def getWebpage(url):
    print '[*] getWebpage()'
    url_file = urllib.urlopen(url)
    page = url_file.read()
    return page
def main():
    sys.argv.append('http://www.funeralformyfat.tumblr.com')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_get URL'
        return
    else:
        print getWebpage(sys.argv[1])

if __name__ == '__main__':
    main()


NameError: name 'links' is not defined

This one occurs because you have scoped the links list outside of the function that defines that variable. The correct indentation would be

def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'

    for link in links:
        print link
shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90
  • If you look closer at the code provided you will see that the command line parameter is appended in the main() function. – mwil.me Nov 22 '13 at 14:05
  • @AshishNitinPatil Hi, thanks for the reply. I made changes but now I get an error on this line; sys.argv.append('http://www.funeralformyfat.tumblr.com') NameError: global name 'sys' is not defined, do you know why this is? – NatD Nov 22 '13 at 15:15
  • Did you `import sys` before the execution of that line? – shad0w_wa1k3r Nov 22 '13 at 17:52
  • @AshishNitinPatil Hi, yeah. Sorry, I was being an idiot. I've updated it but now when I run it I don't get any links no matter what url I enter. I've updated my question with my new code. Could you tell me what's wrong? – NatD Nov 22 '13 at 20:43
  • You need to verify your regex. – shad0w_wa1k3r Nov 23 '13 at 14:03