0

I have multiple html documents and each one has many occurrences of

<a name="pIDsomestring"> 

where 'somestring' varies with each occurrence.

I want to delete the entire tag, as well as the

</a> 

closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.

Is there an easy way to do this with sed?

dmohr
  • 2,699
  • 1
  • 22
  • 22
  • 2
    [don't parse html with regular expressions](http://stackoverflow.com/a/1732454/7552) – glenn jackman Jan 12 '16 at 00:11
  • Ok, I get that, reference http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags however even the second most popular response suggests that with simple HTML, it is still sometimes a workable approach. – dmohr Jan 12 '16 at 00:32

1 Answers1

1

HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:

<a name="foo">bar</a>

<A
    NAME = "foo"
    ><!-- </A> --bar</>-- -->

(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the <!-- comment -- not-a-comment -- comment --> notation.)

The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.

But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:

's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'

which will usually work as long as the whole thing is on one line.

ruakh
  • 175,680
  • 26
  • 273
  • 307