0

I know s/&/\&/g replaces all escaped ampersands and replaces them with ampersands. I want to be more picky. I want to only replace those escaped ampersands if they are in an href. I can't figure it out.

I was trying the following but it wasn't working:

echo "<a href="http://example.com?q=man&amp;string=1&amp;bat=2">Link</a>" | sed -E 's/^href="(.*)&amp;/\1&/g'

It didn't work. I also see another problem being it would only do the first instance of an escaped ampersand and not all. Anyone know what the solution might be?

user983223
  • 1,146
  • 2
  • 16
  • 32
  • 2
    Do you have access to a language with an HTML parser? BTW, the ampersands in a URL that is inside an HTML attribute *should* be represented as `&` or you risk interesting and unexpected behavior. – mu is too short Feb 24 '12 at 07:08
  • @muistooshort - Don't want a parser...just interested in this one case...I thought the urls should be & but this one website only works if it is unescaped and there are a lot of links to it so it would be good to target it. – user983223 Feb 24 '12 at 07:26
  • The URL format in the HTML is not the same as the URL that gets sent to the remote server. The browser is supposed to apply an HTML decoding before sending the URL off. Perhaps you want to extract the `href` attributes and then HTML-decode the extract attributes rather than replacing them in situ. – mu is too short Feb 24 '12 at 07:35
  • Is it possible that some hrefs already contain escaped '&' elements (e.g. <)? In that case you should not escape them again and it will make it really difficult to solve with a regular expression – mac Feb 24 '12 at 08:03
  • You may not want a structured parser, but that doesn't mean you shouldn't be using one. You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Sam Greenhalgh Feb 24 '12 at 09:14

2 Answers2

0

Not sure how to do it with sed, but here's Ruby:

echo '<a href="http://example.com?q=man&amp;string=1&amp;bat=2">Link</a>' | ruby -pe '$_.gsub!(/href="([^"]*)"/) { |h| h.gsub("&amp;", "&") }'

However, I fully support @muistooshort's comment: unless you're doing something weird, you should want the &amp; in there.

Amadan
  • 191,408
  • 23
  • 240
  • 301
0
perl -e '$url=$ARGV[0]; while ( $url =~ s/(<a href="[^"]+?)&amp;/$1&/ ){};print "$url\n"' '<a href="http://example.com?q=man&amp;string=1&amp;bat=2">Link</a>'

Easily amended to run through a file