How would I make a simple URL extracter in Python?

Question

How would I start on a single web page, let's say at the root of DMOZ.org and index every single url attached to it. Then store those links inside a text file. I don't want the content, just the links themselves. An example would be awesome.

Why do you need this in python? `wget` can do this without reinventing the wheel — Daenyth, Oct 13 '10 at 15:58

score 2 · Accepted Answer · edited May 23 '17 at 12:18

2

This, for instance, would print out links on this very related (but poorly named) question:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

edited May 23 '17 at 12:18

Community

1
1

answered Oct 13 '10 at 17:06

Nick T

25,754
12
83
121

You should use `if 'href' in link:` rather than `link.has_key`. `has_key` is deprecated and removed from python 3. – Daenyth Oct 13 '10 at 17:41
For me (Py 2.6.5, BS 3.0.8) `'href' in link` returns `False`, even though `link['href']` will give me a URL. I don't know that much about the workings of dictionaries though. `'href' in zip(*link.attrs)[0]` does seem to work, but is ugly. – Nick T Oct 13 '10 at 18:38

score 0 · Answer 2 · edited May 23 '17 at 11:55

0

If you insist on reinventing the wheel, use an html parser like BeautifulSoup to grab all the tags out. This answer to a similar question is relevant.

edited May 23 '17 at 11:55

Community

1
1

answered Oct 13 '10 at 16:42

Daenyth

35,856
13
85
124

score 0 · Answer 3 · answered Oct 14 '10 at 08:47

0

Scrapy is a Python framework for web crawling. Plenty of examples here: http://snippets.scrapy.org/popular/bookmarked/

answered Oct 14 '10 at 08:47

ScraperWiki

15
2

How would I make a simple URL extracter in Python?

3 Answers3