2

Okay, this is an easy one, but I can't figure it out.

Basically I want to extract all links (<a href="[^<>]*">[^<>]*</a>) from a big html file.

I tried to do this with sed, but I get all kinds of results, just not what I want. I know that my regexp is correct, because I can replace all the links in a file:

sed 's_<a href="[^<>]*">[^<>]*</a>_TEST_g'

If I run that on something like

<div><a href="http://wwww.google.com">A google link</a></div>
<div><a href="http://wwww.google.com">A google link</a></div>

I get

<div>TEST</div>
<div>TEST</div>

How can I get rid of everything else and just print the matches instead? My preferred end result would be:

<a href="http://wwww.google.com">A google link</a>
<a href="http://wwww.google.com">A google link</a>

PS. I know that my regexp is not the most flexible one, but it's enough for my intentions.

DrummerB
  • 39,814
  • 12
  • 105
  • 142

4 Answers4

4

Match the whole line, put the interesting part in a group, replace by the content of the group. Use the -n option to suppress non-matching lines, and add the p modifier to print the result of the s command.

sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

Note that if there are multiple links on the line, this only prints the last link. You can improve on that, but it goes beyond simple sed usage. The simplest method is to use two steps: first insert a newline before any two links, then extract the links.

sed -n -e 's!</a>!&\n!p' | sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

This still doesn't handle HTML comments, <pre>, links that are spread over several lines, etc. When parsing HTML, use an HTML parser.

Community
  • 1
  • 1
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
2

If you don't mind using perl like sed it can copy with very diverse input:

  perl -n -e 's+(<a href=.*?</a>)+ print $1, "\n" +eg;'
Gilbert
  • 3,740
  • 17
  • 19
1

Assuming that there is only one hyperlink per line the following may work...

  sed -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>ed &lt&lt'EOF'
 -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>_'
Gilbert
  • 3,740
  • 17
  • 19
0

This might work for you (GNU sed):

sed '/<a href\>/!d;s//\n&/;s/[^\n]*\n//;:a;$!{/>/!{N;ba}};y/\n/ /;s//&\n/;P;D' file
potong
  • 55,640
  • 6
  • 51
  • 83