0

I would like to remove following tag from HTML including its constantly varying contents:

<span class="the_class_name">li4tuq734g23r74r7Whatever</span>

A following BASH script

.... | sed -e :a -re 's/<span class="the_class_name"/>.*</span>//g' > "$NewFile"

ends with error

sed: -e expression #2, char XX: unknown option to `s'

I tried to escape quotes, slashes and "less than" symbols in various combinations and still get this error.

Paul
  • 25,812
  • 38
  • 124
  • 247
  • 2
    Ok, first problem gone. Time to work on the regex :-) You see the `/` in the first `span` tag in your regex? That doesn't seem to match your input. Perhaps `sed -E 's,[^<]*,,g'` would be better – Ted Lyngmo Jul 05 '22 at 19:27
  • 1
    `sed '\#[^<]*#d'` – HatLess Jul 05 '22 at 19:29
  • 1
    @HatLess That'd delete the whole line if I'm not mistaken? – Ted Lyngmo Jul 05 '22 at 19:33
  • 2
    Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jul 05 '22 at 19:41
  • Does `the_class_name` occur only once in the XML file? – Cyrus Jul 05 '22 at 20:11
  • @Cyrus: Multiple times. All I want to do is to download a web page with wget, remove all the information that changes each time and if the new file differs from the previous one - send a message over Jabber. – Paul Jul 05 '22 at 20:19
  • I suggest to add your XML file to your question (no comment) with unimportant parts removed. It is important that the file is still a valid XML file. – Cyrus Jul 05 '22 at 20:27
  • It's HTML. Doesn't have to be a valid XML. And it's too long for being posted here. – Paul Jul 05 '22 at 20:53

1 Answers1

2

I suggest using a different separator than / when / is contained within the thing you want to match on. Also, prefer -E instead of -r for extended regex to be Posix compatible. Also note that you have a / in your first span in your regex that doesn't belong there. Also, .* will make it overly greedy and eat up any </span> that follows the first </span> on the line. It's better to match on [^<]*. That is, any character that is not <.

sed -E 's,<span class="the_class_name">[^<]*</span>,,g'

A better option is of course to use a HTML parser for this.

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
  • Ted, you seem to have a lot of experience with processing HTML on the command line. What command line HTML parser would you suggest? – Paul Jul 05 '22 at 19:44
  • 1
    @Paul I _had_ xp doing that some 20 years ago :-) I don't remember much. Cyrus suggested a few above. I recognize `xmllint` so that's been around for a long time. – Ted Lyngmo Jul 05 '22 at 19:45