Remove specific tag with its contents using sed

Question

I would like to remove following tag from HTML including its constantly varying contents:

<span class="the_class_name">li4tuq734g23r74r7Whatever</span>

A following BASH script

.... | sed -e :a -re 's/<span class="the_class_name"/>.*</span>//g' > "$NewFile"

ends with error

sed: -e expression #2, char XX: unknown option to `s'

I tried to escape quotes, slashes and "less than" symbols in various combinations and still get this error.

Ok, first problem gone. Time to work on the regex :-) You see the `/` in the first `span` tag in your regex? That doesn't seem to match your input. Perhaps `sed -E 's,[^<]*,,g'` would be better — Ted Lyngmo, Jul 05 '22 at 19:27
Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Jul 05 '22 at 19:41
@Cyrus: Multiple times. All I want to do is to download a web page with wget, remove all the information that changes each time and if the new file differs from the previous one - send a message over Jabber. — Paul, Jul 05 '22 at 20:19
I suggest to add your XML file to your question (no comment) with unimportant parts removed. It is important that the file is still a valid XML file. — Cyrus, Jul 05 '22 at 20:27
It's HTML. Doesn't have to be a valid XML. And it's too long for being posted here. — Paul, Jul 05 '22 at 20:53

score 2 · Accepted Answer · answered Jul 05 '22 at 19:41

2

I suggest using a different sed separator than / when / is contained within the thing you want to match on. Also, prefer -E instead of -r for extended regex to be Posix compatible. Also note that you have a / in your first span in your regex that doesn't belong there. Also, .* will make it overly greedy and eat up any </span> that follows the first </span> on the line. It's better to match on [^<]*. That is, any character that is not <.

sed -E 's,<span class="the_class_name">[^<]*</span>,,g'

A better option is of course to use a HTML parser for this.

answered Jul 05 '22 at 19:41

Ted Lyngmo

93,841
5
60
108

Ted, you seem to have a lot of experience with processing HTML on the command line. What command line HTML parser would you suggest? – Paul Jul 05 '22 at 19:44
1

@Paul I _had_ xp doing that some 20 years ago :-) I don't remember much. Cyrus suggested a few above. I recognize `xmllint` so that's been around for a long time. – Ted Lyngmo Jul 05 '22 at 19:45

Remove specific tag with its contents using sed

1 Answers1