Easiest way to extract the urls from an html page using sed or awk only

Question

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

Read this and be enlightened: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Dennis Williamson, Dec 10 '09 at 14:44
If you don't mind that: *There is no guarantee that you find all urls.* **or** *There is no guarantee that all urls you find are valid.* use one of the examples below. If you do mind use an appropriate tool for the job (perl, python, ruby) — Nifle, Dec 10 '09 at 14:59
My previous comment is of course for any *easy* solution you might try. awk is powerful enough to do the job, heck you could theoretically implement perl in awk... — Nifle, Dec 10 '09 at 15:02
Is this like one of those survivor challenges, where you have to live for three days eating only termites? If not, seriously, why the restriction? Every modern system can install at least Perl, and from there, you have the whole web — Randal Schwartz, Dec 21 '09 at 02:33

score 65 · Answer 1 · edited Apr 08 '15 at 14:21

65

You could also do something like this (provided you have lynx installed)...

Lynx versions < 2.8.8

lynx -dump -listonly my.html

Lynx versions >= 2.8.8 (courtesy of @condit)

lynx -dump -hiddenlinks=listonly my.html

edited Apr 08 '15 at 14:21

fatuhoku

4,815
3
30
70

answered Jan 04 '10 at 13:06

Hardy

18,659
3
49
65

4

In Lynx 2.8.8 this has become `lynx -dump -hiddenlinks=listonly my.html` – condit May 07 '14 at 22:17
Better `lynx dump -listonly -hiddenlinks=listonly my.html`; if you don't still have the bare `-listonly` you get body text, not just links. – Charles Duffy Sep 25 '22 at 17:33

score 45 · Answer 2 · edited Oct 15 '19 at 23:27

45

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

edited Oct 15 '19 at 23:27

Hayden Schiff

3,280
19
41

answered Dec 17 '09 at 23:09

Greg Bacon

134,834
32
188
245

2

Almost perfect, but what about this two cases: 1. You are matching only the ones that start with Match me 2. What if there's two anchors in the same line I made this modifications to the original solution: `code` cat index.html | grep -o '' | sed -e 's/ – Crisboot Aug 06 '12 at 10:23
1

thanks, works on Mac compared to many other solutions mentioned above – Roman Chernyatchik Aug 20 '18 at 17:09

score 18 · Answer 3 · answered Sep 19 '12 at 12:48

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq

The first grep looks for lines containing urls. You can add more elements after if you want to look only on local pages, so no http, but relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

Nice break down of what each step should do. – Jeremy J Starcher Sep 20 '12 at 06:52 — Jeremy J Starcher, Sep 20 '12 at 06:52

Ingo Karkat · Answer 4 · 2017-08-24T08:13:09.903

16

With the Xidel - HTML/XML data extraction tool, this can be done via:

$ xidel --extract "//a/@href" http://example.com/

With conversion to absolute URLs:

$ xidel --extract "//a/resolve-uri(@href, base-uri())" http://example.com/

edited Aug 24 '17 at 08:13

answered Mar 13 '13 at 13:51

Ingo Karkat

167,457
16
250
324

concat expects 2 arguments but here only one (base url is given). err:XPST0017: unknown function: concat #1 Did you mean: In module http://www.w3.org/2005/xpath-functions: concat #2-65535 – smihael Aug 24 '17 at 08:04
@smihael: You're right, that's superfluous here. Removed it. Thanks for noticing! – Ingo Karkat Aug 24 '17 at 08:13

Crisboot · Answer 5 · 2012-08-09T12:55:20.360

15

I made a few changes to Greg Bacon Solution

cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

This fixes two problems:

We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

edited Aug 09 '12 at 12:55

answered Aug 06 '12 at 10:28

Crisboot

1,420
2
18
29

But at least it solves the problem, none of the other solutions does – Crisboot Aug 06 '12 at 12:30
1

The best option here if you don't want to use Lynx and your anchors don't start with – simon Feb 23 '18 at 08:06

ghostdog74 · Answer 6 · 2009-12-10T14:49:38.120

12

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

edited Dec 10 '09 at 14:49

answered Dec 10 '09 at 14:26

ghostdog74

327,991
56
259
343

Does this work for 'SELFHTML aktuell' – Ralph M. Rickenbach Dec 10 '09 at 14:40
1

if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see. – ghostdog74 Dec 10 '09 at 14:54
1

this really did the work, many great thanx for this great awk bundle! – SomniusX Jul 01 '14 at 08:38

score 5 · Answer 7 · answered Dec 10 '09 at 14:28

5

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

answered Dec 10 '09 at 14:28

nes1983

15,209
4
44
64

2

complicated, and fails when href is like this: ... HREF="http://www.somewhere.com/" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ... – ghostdog74 Dec 10 '09 at 14:35
I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar. – nes1983 Dec 10 '09 at 14:45
5

You don't need to have a cat before the grep. Just put f.html at the end of grep – monksy Apr 13 '12 at 05:10
And grep -o can fail due to a bug in some versions of grep. – kisp Aug 23 '13 at 21:45

Alok Singhal · Answer 8 · 2009-12-15T08:18:06.193

5

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

edited Dec 15 '09 at 08:18

answered Dec 15 '09 at 07:43

Alok Singhal

93,253
21
125
158

This is the easiest and simplest answer. Just do e.g. `wget http://sed.sourceforge.net/grabbag/scripts/list_urls.sed -O ~/bin/list_urls.sed && chmod +x ~/bin/list_urls.sed` to get the script, and then `wget http://www.example.com -O - | ~/bin/list_urls.sed > example.com.urls.txt` to get the urls in a text file! – arjan Feb 18 '16 at 22:56

Brad Parks · Answer 9 · 2018-12-14T12:02:54.237

In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)

$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

for example:

$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq

generates

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

score 3 · Answer 10 · edited Jun 20 '20 at 09:12

This is my first post, so I try to do my best explaining why I post this answer...

Since the first 7 most voted answers, 4 include GREP even when the post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was required ) to do it in BASH.

So here come the simplest script from GNU grep 2.28:

grep -Po 'href="\K.*?(?=")'

About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer.... the \K switch get rid the previous chars ( and the key itself ). Bear in mind following the advice from man pages: "This is highly experimental and grep -P may warn of unimplemented features."

Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...

I hope folks you find it very useful.

thanks!!!

score 2 · Answer 11 · edited May 23 '17 at 12:32

Expanding on kerkael's answer:

grep "<a href=" sourcepage.html
  |sed "s/<a href/\\n<a href/g" 
  |sed 's/\"/\"><\/a>\n/2'
  |grep href
  |sort |uniq
# now adding some more
  |grep -v "<a href=\"#"
  |grep -v "<a href=\"../"
  |grep -v "<a href=\"http"

The first grep I added removes links to local bookmarks.

The second removes relative links to upper levels.

The third removes links that don't start with http.

Pick and choose which one of these you use as per your specific requirements.

score 1 · Answer 12 · 2015-12-09T05:59:19.480

Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.

The rest should be easy, here is an example:

sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"

alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/$^http[s]*:[a-Z0-9/.=?_-]*$$.*$/\1/p"; }; _'

score 1 · Answer 13 · answered Feb 13 '21 at 21:05

1

Eschewing the awk/sed requirement:

urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).

answered Feb 13 '21 at 21:05

Marek Kowalczyk

101
9

[urlextract](https://github.com/lipoja/URLExtract/) worked fantastic — I was able to only extract around 30% of the desired URLs (exactly 100 in total) with lynx and grep. lynx gives the `Bad HTML!` error for the page (or a local HTML file in this case.' – user598527 Aug 06 '23 at 15:40

score 0 · Answer 14 · edited Mar 11 '13 at 18:40

0

You can try:

curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a>&nbsp;<\/td>//g'| column -c2 -t|awk '{print $1}'

edited Mar 11 '13 at 18:40

Anthon

69,918
32
186
246

answered Mar 11 '13 at 18:00

dpathak

29
1

score 0 · Answer 15 · answered Apr 15 '17 at 11:39

0

That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.

a=$1

lynx -listonly -dump "$a" > temp

awk 'FNR > 2 {print$2}' temp > temp2.txt

rm temp

>sh test.sh http://link.com

answered Apr 15 '17 at 11:39

Abhishek Gurjar

7,426
10
37
45

I strongly suggest to use a pipeline instead of temporary files: lynx -listonly -dump "$url" | awk 'FNR > 2 {print$2}' – Raúl Salinas-Monteagudo Jun 06 '17 at 15:22

score 0 · Answer 16 · answered Aug 31 '21 at 22:01

I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'

Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.

Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.

The awk finds any line that begins with href or src and outputs it.

Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:

curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'

Once the subset is extracted, just remove the href=" or src="

sed -r 's~(href="|src=")~~g'

This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.

Easiest way to extract the urls from an html page using sed or awk only

16 Answers16

Linked

Related