Use sed to delete all occurrences of

Question

I have multiple html documents and each one has many occurrences of

<a name="pIDsomestring">

where 'somestring' varies with each occurrence.

I want to delete the entire tag, as well as the

</a>

closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.

Is there an easy way to do this with sed?

[don't parse html with regular expressions](http://stackoverflow.com/a/1732454/7552) — glenn jackman, Jan 12 '16 at 00:11
Ok, I get that, reference http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags however even the second most popular response suggests that with simple HTML, it is still sometimes a workable approach. — dmohr, Jan 12 '16 at 00:32

score 1 · Answer 1 · answered Jan 12 '16 at 00:27

HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:

<a name="foo">bar</a>

<A
    NAME = "foo"
    ><!-- </A> --bar</>-- -->

(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the  notation.)

The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.

But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:

's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'

which will usually work as long as the whole thing is on one line.

Use sed to delete all occurrences of

1 Answers1