0

I have a sed command that is working fine, except when it comes across a newline right in the file somewhere. Here is my command:

sed -i 's,<a href="\(.*\)">\(.*\)</a>,\2 - \1,g'

Now, it works perfectly, but I just ran across this file that has the a tag like so:

<a href="link">Click
        here now</a>

Of course it didn't find this one. So I need to modify it somehow to allow for lines breaks in the search. But I have no clue how to make it allow for that unless I go over the entire file first off and remove all \n before hand. Problem there is I loose all formatting in the file.

jfreak53
  • 2,239
  • 7
  • 37
  • 53

2 Answers2

2

You can do this by inserting a loop into your sed script:

sed -e '/<a href/{;:next;/<\/a>/!{N;b next;};s,<a href="\(.*\)">\(.*\)</a>,\2 - \1,g;}' yourfile

As-is, that will leave an embedded newline in the output, and it wasn't clear if you wanted it that way or not. If not, just substitute out the newline:

sed -e '/<a href/{;:next;/<\/a>/!{N;b next;};s/\n//g;s,<a href="\(.*\)">\(.*\)</a>,\2 - \1,g;}' yourfile

And maybe clean up extra spaces:

sed -e '/<a href/{;:next;/<\/a>/!{N;b next;};s/\n//g;s/\s\{2,\}/ /g;s,<a href="\(.*\)">\(.*\)</a>,\2 - \1,g;}' yourfile

Explanation: The /<a href/{...} lets us ignore lines we don't care about. Once we find one we like, we check to see if it has the end marker. If not (/<\a>/!) we grab the next line and a newline (N) and branch (b) back to :next to see if we've found it yet. Once we find it we continue on with the substitutions.

William
  • 4,787
  • 2
  • 15
  • 14
  • I keep getting the following error: `sh: 1: Syntax error: Unterminated quoted string` – jfreak53 Apr 05 '13 at 22:53
  • That's coming from your shell. Make sure you have the enclosing single quotes in the right places. (I copied and pasted that last example back into my shell and it worked fine.) BTW, if your version of sed doesn't like the \s (whitespace) escape you can use a literal space, or maybe [[:space:]] in its place. – William Apr 05 '13 at 23:01
  • GOT IT! I guess I should have mentioned I was using this command in Mutt in my mailcap file, hence I had to escape each `;` :) woops. Working now though. – jfreak53 Apr 05 '13 at 23:17
0

Here is a quick and dirty solution that assumes there will be no more than one newline in a link:

sed -i '' -e '/<a href=.*>/{/<\/a>/!{N;s|\n||;};}' -e 's,<a href="\(.*\)">\(.*\)</a>,\2 - \1,g'

The first command (/<a href=.*>/{/<\/a>/!{N;s|\n||;};}) checks for the presence of <a href=...> without </a>, in which case it reads the next line into the pattern space and removes the newline. The second is yours.

Beta
  • 96,650
  • 16
  • 149
  • 150