Simply put, there is no simple solution. Having little dependencies leads to unsightly code, and vice versa: code robustness leads to higher dependency requirements.
Having this in mind, below I describe a few solutions and sum them up by providing pros and cons of each one.
Approach 1
You can use wget
's -k
option together with some regular expressions (read more about parsing HTML that way).
From Linux manual:
-k
--convert-links
After the download is complete, convert the links in the document to
make them suitable for local viewing.
(...)
The links to files that have not been downloaded by Wget will be
changed to include host name and absolute path of the location they
point to.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif
(or to ../bar/img.gif), then the link in doc.html will be modified to
point to http://hostname/bar/img.gif.
An example script:
#wget needs a file in order for -k to work
tmpfil=$(mktemp);
#-k - convert links
#-q - suppress output
#-O - redirect output to given file
wget http://example.com -k -q -O "$tmpfil";
#-o - print only matching parts
#you could use any other popular regex here
grep -o "http://[^'\"<>]*" "$tmpfil"
#remove unnecessary file
rm "$tmpfil"
Pros:
- Works out of the box on most systems, assuming you have
wget
installed.
- In most cases, this will be sufficient solution.
Cons:
- Features regular expressions, which are bound to break on some exotic pages due to HTML hierarchical model standing below regular expressions in Chomsky hierarchy.
- You cannot pass a location in your local file system; you must pass working URL.
Approach 2
You can use Python together with BeautifulSoup. An example script:
#!/usr/bin/python
import sys
import urllib
import urlparse
import BeautifulSoup
if len(sys.argv) <= 1:
print >>sys.stderr, 'Missing URL argument'
sys.exit(1)
content = urllib.urlopen(sys.argv[1]).read()
soup = BeautifulSoup.BeautifulSoup(content)
for anchor in soup.findAll('a', href=True):
print urlparse.urljoin(sys.argv[1], anchor.get('href'))
And then:
dummy:~$ ./test.py http://example.com
Pros:
- It's the correct way to handle HTML, since it's properly using fully-fledged parser.
- Exotic output is very likely going to be handled well.
- With small modifications, this approach works for files, not URLs only.
- With small modifications, you might even be able to give your own base URL.
Cons:
- It needs Python.
- It needs Python with custom package.
- You need to manually handle tags and attributes like
<img src>
, <link src>
, <script src>
etc (which isn't presented in the script above).
Approach 3
You can use some features of lynx
. (This one was mentioned in the answer you provided in your question.) Example:
lynx http://example.com/ -dump -listonly -nonumbers
Pros:
- Very concise usage.
- Works well with all kind of HTML.
Cons:
- You need Lynx.
- Although you can extract links from files as well, you cannot control the base URL and you end up with
file://localhost/
links. You can fix this using ugly hacks like manual inserting <base href="">
tag into HTML.