How to print only matches with sed?

Question

Okay, this is an easy one, but I can't figure it out.

Basically I want to extract all links (<a href="[^<>]*">[^<>]*</a>) from a big html file.

I tried to do this with sed, but I get all kinds of results, just not what I want. I know that my regexp is correct, because I can replace all the links in a file:

sed 's_<a href="[^<>]*">[^<>]*</a>_TEST_g'

If I run that on something like

<div><a href="http://wwww.google.com">A google link</a></div>
<div><a href="http://wwww.google.com">A google link</a></div>

I get

<div>TEST</div>
<div>TEST</div>

How can I get rid of everything else and just print the matches instead? My preferred end result would be:

<a href="http://wwww.google.com">A google link</a>
<a href="http://wwww.google.com">A google link</a>

PS. I know that my regexp is not the most flexible one, but it's enough for my intentions.

Thanks, that works too. I'd still like to know if it's possible with sed. — DrummerB, Aug 25 '12 at 23:38

score 4 · Accepted Answer · edited May 23 '17 at 12:23

Match the whole line, put the interesting part in a group, replace by the content of the group. Use the -n option to suppress non-matching lines, and add the p modifier to print the result of the s command.

sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

Note that if there are multiple links on the line, this only prints the last link. You can improve on that, but it goes beyond simple sed usage. The simplest method is to use two steps: first insert a newline before any two links, then extract the links.

sed -n -e 's!</a>!&\n!p' | sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'

This still doesn't handle HTML comments, <pre>, links that are spread over several lines, etc. When parsing HTML, use an HTML parser.

score 2 · Answer 2 · answered Aug 25 '12 at 23:56

2

If you don't mind using perl like sed it can copy with very diverse input:

  perl -n -e 's+(<a href=.*?</a>)+ print $1, "\n" +eg;'

answered Aug 25 '12 at 23:56

Gilbert

3,740
17
19

score 1 · Answer 3 · answered Aug 25 '12 at 23:42

1

Assuming that there is only one hyperlink per line the following may work...

  sed -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>ed &lt&lt'EOF'
 -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>_'

answered Aug 25 '12 at 23:42

Gilbert

3,740
17
19

Unfortunately that's not the case :( – DrummerB Aug 25 '12 at 23:43

score 0 · Answer 4 · answered Aug 26 '12 at 07:24

0

This might work for you (GNU sed):

sed '/<a href\>/!d;s//\n&/;s/[^\n]*\n//;:a;$!{/>/!{N;ba}};y/\n/ /;s//&\n/;P;D' file

answered Aug 26 '12 at 07:24

potong

55,640
6
51
83

How to print only matches with sed?

4 Answers4