0

I have two sets of scripts. One to download a webpage and another to download links from the webpage. They both run but the links script doesn't return any scripts. Can anyone see or tell me why?

webpage script;

import sys, urllib
def getWebpage(url):
    print '[*] getWebpage()'
    url_file = urllib.urlopen(url)
    page = url_file.read()
    return page
def main():
    sys.argv.append('http://www.bbc.co.uk')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_get URL'
        return
    else:
        print getWebpage(sys.argv[1])

if __name__ == '__main__':
    main()

Links Script

import sys, urllib, re
import getWebpage
def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'

    for link in links:
        print link

def main():
    sys.argv.append('http://www.bbc.co.uk')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_links URL'
        return
    page = webpage_get.getWebpage(sys.argv[1])
    print_links(page)
NatD
  • 9
  • 3
  • run it from [the command line](http://learnpythonthehardway.org/book/appendixa.html). You should see either `ImportError` or `NameError` (`getWebpage` vs. `webpage_get`). – jfs Nov 23 '13 at 20:36
  • This does not look right: `import getWebpage` given this usage: `page = webpage_get.getWebpage(sys.argv[1])`. Assuming your file is named `webpage_get.py` and yoiu have a file named `\_\_init\_\_.py` in the same directory, you want this: `from webpage_get import getWebpage; page = getWebpage(sys.argv[1])` – hughdbrown Nov 23 '13 at 20:44
  • 1
    Also, you have no call to `main()` in your second script, so likely it does nothing at all. – hughdbrown Nov 23 '13 at 20:47
  • I've fixed it. Thank you. It was the call to main. I assumed it would take it from the first script. – NatD Nov 23 '13 at 20:51
  • Thanks to @J.F.Sebastian and hughdbrown – NatD Nov 23 '13 at 20:55
  • unrelated: you could use [`scrapy` to extract info that you are interested in from a web site](http://doc.scrapy.org/en/latest/intro/tutorial.html). There are [many resources to help you to get started with `scrapy`](https://github.com/scrapy/scrapy/wiki). – jfs Nov 23 '13 at 21:10

2 Answers2

1

This will fix most of your problems:

import sys, urllib, re

def getWebpage(url):
    print '[*] getWebpage()'
    url_file = urllib.urlopen(url)
    page = url_file.read()
    return page

def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'
    for link in links:
        print link

def main():
    site = 'http://www.bbc.co.uk'
    page = getWebpage(site)
    print_links(page)

if __name__ == '__main__':
    main()

Then you can move on to fixing your regular expression.

While we are on the topic, though, I have two material recommendations:

Community
  • 1
  • 1
hughdbrown
  • 47,733
  • 20
  • 85
  • 108
0

Your regular expression doesn't have an end, so when you find the first it will display you the entire rest of page as you use the http\:.+ which means return all what is : till the end of the html page you need to specify the as end of the regular expression

djokage
  • 126
  • 4