11

Greetz folks.

I'm looking for a way to do the same stuff than PHP's preg_replace() does (search text matching a regular expression and replace it) in a shell script.

So, consider the following file.

<a href="http://example.com/">Website #1</a>
<a href="http://example.net/">Website #2</a>
<a href="http://example.org/">Website #3</a>

And I want to get this:

http://example.com/
http://example.net/
http://example.org/

Is there a way to do this? Thanks.

codaddict
  • 445,704
  • 82
  • 492
  • 529
seriousdev
  • 7,519
  • 8
  • 45
  • 52
  • Your text differs from your example. Do you want to extract part(s) of your strings (as in your examples) or do you want to actually replace it with something? – plundra Dec 20 '10 at 15:18
  • 1
    Also, [don't parse HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) (in general). –  Dec 20 '10 at 15:19
  • If you say, "So, consider the following file.", then people are gonna assume it's the data. Make a proper question next time. – Anders Dec 20 '10 at 15:20
  • @plundra Yes, sorry. Actually you're right, "extract" is the correct word. @delnan Well I just want to extract some strings... @Anders You're absolutely right. – seriousdev Dec 20 '10 at 15:22

2 Answers2

10

You can use sed as:

sed -r 's/.*href="([^"]*)".*/\1/' file

See it

codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Great, thanks! So I assume `s` is to tell sed to use a regex but what is `\1/` for? – seriousdev Dec 20 '10 at 15:26
  • No, `s` is substitute and `\1` is the first match (group? unsure about the term), 1 being the content of the first parenthesis. `[^"]*` in the case above. – plundra Dec 20 '10 at 15:32
0

While sed is perfectly suitable, it doesn't allow more than 9 backreferences. Perl does:

echo "a b c d e f g h i j k l m n o p q r s t u v w x y z" | \
    perl -lpe 's/(\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)/$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14;$15;$16;$17;$18;$19;$20;$21;$22;$23;$24;$25;$26/g'
a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z

This (dumb) example show it's possible to go further than sed's \9

Benoit Duffez
  • 11,839
  • 12
  • 77
  • 125