0

I want to get all URLs from specific page in Bash.

This problem is already solved here: Easiest way to extract the urls from an html page using sed or awk only

The trick, however, is to parse relative links into absolute ones. So if http://example.com/ contains links like:

<a href="/about.html">About us</a>
<script type="text/javascript" src="media/blah.js"></a>

I want the results to have following form:

http://example.com/about.html
http://example.com/media/blah.js

How can I do so with as little dependencies as possible?

Community
  • 1
  • 1
rr-
  • 14,303
  • 6
  • 45
  • 67

3 Answers3

7

Simply put, there is no simple solution. Having little dependencies leads to unsightly code, and vice versa: code robustness leads to higher dependency requirements.

Having this in mind, below I describe a few solutions and sum them up by providing pros and cons of each one.

Approach 1

You can use wget's -k option together with some regular expressions (read more about parsing HTML that way).

From Linux manual:

-k
--convert-links
    After the download is complete, convert the links in the document to 
    make them suitable for local viewing.  
    (...)
    The links to files that have not been downloaded by Wget will be 
    changed to include host name and absolute path of the location they 
    point to.
    Example: if the downloaded file /foo/doc.html links to /bar/img.gif
    (or to ../bar/img.gif), then the link in doc.html will be modified to
    point to http://hostname/bar/img.gif.

An example script:

#wget needs a file in order for -k to work
tmpfil=$(mktemp);

#-k - convert links
#-q - suppress output
#-O - redirect output to given file
wget http://example.com -k -q -O "$tmpfil";

#-o - print only matching parts
#you could use any other popular regex here
grep -o "http://[^'\"<>]*" "$tmpfil"

#remove unnecessary file
rm "$tmpfil"

Pros:

  1. Works out of the box on most systems, assuming you have wget installed.
  2. In most cases, this will be sufficient solution.

Cons:

  1. Features regular expressions, which are bound to break on some exotic pages due to HTML hierarchical model standing below regular expressions in Chomsky hierarchy.
  2. You cannot pass a location in your local file system; you must pass working URL.

Approach 2

You can use Python together with BeautifulSoup. An example script:

#!/usr/bin/python
import sys
import urllib
import urlparse
import BeautifulSoup

if len(sys.argv) <= 1:
    print >>sys.stderr, 'Missing URL argument'
    sys.exit(1)

content = urllib.urlopen(sys.argv[1]).read()
soup = BeautifulSoup.BeautifulSoup(content)
for anchor in soup.findAll('a', href=True):
    print urlparse.urljoin(sys.argv[1], anchor.get('href'))

And then:

dummy:~$ ./test.py http://example.com

Pros:

  1. It's the correct way to handle HTML, since it's properly using fully-fledged parser.
  2. Exotic output is very likely going to be handled well.
  3. With small modifications, this approach works for files, not URLs only.
  4. With small modifications, you might even be able to give your own base URL.

Cons:

  1. It needs Python.
  2. It needs Python with custom package.
  3. You need to manually handle tags and attributes like <img src>, <link src>, <script src> etc (which isn't presented in the script above).

Approach 3

You can use some features of lynx. (This one was mentioned in the answer you provided in your question.) Example:

lynx http://example.com/ -dump -listonly -nonumbers

Pros:

  1. Very concise usage.
  2. Works well with all kind of HTML.

Cons:

  1. You need Lynx.
  2. Although you can extract links from files as well, you cannot control the base URL and you end up with file://localhost/ links. You can fix this using ugly hacks like manual inserting <base href=""> tag into HTML.
Community
  • 1
  • 1
rr-
  • 14,303
  • 6
  • 45
  • 67
4

Another option is my Xidel (XQuery/Webscraper):

For all normal links:

xidel http://example.com/ -e '//a/resolve-uri(@href)'

For all links and srcs:

xidel http://example.com/ -e '(//@href, //@src)/resolve-uri(.)'

With rr-'s format:

Pros :

  1. Very concise usage.

  2. Works well with all kind of HTML.

  3. It's the correct way to handle HTML, since it's properly using fully-fledged parser.

  4. Works for files and urls

  5. You can give your own base URL. (with resolve-uri(@href, "baseurl"))

  6. No dependancies except Xidel (except openssl, if you also have https urls)

Cons:

  1. You need Xidel, which is not contained in any standard repository
BeniBela
  • 16,412
  • 4
  • 45
  • 52
1

Why not simply this ?

re='(src|href)='
baseurl='example.com'
wget -O- "http://$baseurl" | awk -F'(src|href)=' -F\" "/$re/{print $baseurl\$2}"

you just need and .

Feel free to improve the snippet a bit if you have both relative & absolute urls at the same time.

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223