Removing elements by name from an HTML document from the shell - Sed command fails

Question

I'm trying to remove the embedded CSS from a html file on Linux Server (Red Hat 6.8). e.g. File 1.htm is as below:

abc
<style type="text/css">
whatever
1
2
3
</style>
def

And what I need is

abc
def

I tried the sed command below

sed -i 's#<style type="text\/css">(.|\n)*<\/style>##g' 1.htm

but it's not working. Could someone shed some light on this? Thanks~

Another very good tool for this is [**html2text**](http://www.mbayer.de/html2text/). It will handle all html/markdown/css etc.. — David C. Rankin, Jul 11 '17 at 03:02
Thanks for the suggestion. html2text is not available on the server for the moment. will try to upload one. — Minitab, Jul 11 '17 at 04:03
@DavidC.Rankin: I eventually realized that this question is about _manipulating HTML_ (removing elements from an HTML document), which is why `html2text` (for _extracting plain-text data_) is not the right tool. Aside from that, while `html2text` produces nicely formatted output for _display_, using it to extract data for _programmatic_ processing is problematic. — mklement0, Jul 13 '17 at 01:10
@mklement0 I see your point. Based on the content of the question it appears to be asking for a removal that would leave nothing but the plain text (as if you had run `html2text filename`), but in reality if the posted content is a subset of a larger html file, then other wanted elements would be removed as well. — David C. Rankin, Jul 13 '17 at 07:19

mklement0 · Answer 1 · 2017-07-11T17:48:36.410

In order to match across lines, you must instruct sed to read the whole file at once.

With GNU sed (Linux) v4.2.2+, the simplest way to do that is to use -z (whose purpose is to read NUL-separated records; in the absence of embedded NULs, the entire file is read).

Also, given your unescaped use of ( and ) as metacharacters, you must activate support for extended regular expressions via the -r option, although you don't strictly need that, because (.|\n*) (which is equivalent to .*) must be replaced with [^<]* in order to potentially match multiple <style> elements individually (.*, because sed regexes are greedy, would match everything up until the last </style> tag in the file, which would malfunction with multiple elements).

sed -z -r -i 's#<style type="text/css">[^<]*</style>\n?##g' 1.htm

^{Note that I've appended \n? to the regex to ensure that no empty line is left behind by the replacement.

Use of unescaped ? also requires -r.

Since you've chosen # as the s delimiter, you needn't \-escape / chars. in the regex.}

With older GNU sed versions, you can use a loop (:a;$!{N;ba}) to read the entire file at once:

sed -r -i ':a;$!{N;ba}; s#<style type="text/css">[^<]*</style>\n?##g' 1.htm

Generally, for a more robust solution, use an HTML/XML-aware tool such as xsltproc (see below).

Robust solution using XSLT via `xsltproc`:

xsltproc is a third-party utility that comes with macOS and some Linux distributions (e.g., Fedora), and can easily be installed on others (e.g., on Ubuntu, with sudo apt-get install xsltproc).

With the --html option, it is capable of applying XSLT-based transformations to HTML documents too, not just to XML documents.

Here's a sample bash-based solution that demonstrates creating a copy of an HTML document with all <style> elements removed, gratefully adapted from this answer:

# Create a simple sample HTML document with 2 <style> elements at different
# levels of the DOM and save it as "file.html"
cat > file.html <<'EOF'
<html>
<head></head>
<body>
  <style type="text/css">
    * {
      border: 1 solid black;
    }
  </style>
  <p foo='bar'>
    abc def
    <style type="text/css">
      * {
        border: 2 dashed blue;
      }
    </style>
  </p>
</body>
</html>
EOF

xsltproc can then apply an XSLT template to the HTML file (normally, such a template is provided as a file as well, but given its brevity, I'm constructing it in memory and providing it like a file via a bash process substitution (<(...))):

# Define the XSLT template that copies all nodes in the document except those
# named "style".
# For an explanation, see https://stackoverflow.com/a/322079/45375
template='<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="style"/>

</xsl:stylesheet>'

# Invoke xsltproc with the template and the input file.
# --html tells xlstproc to process the file as HTML, both on input and on output.
xsltproc --html <(echo "$template") file.html

The above yields (note how both <style> elements were removed):

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
<body>

  <p foo="bar">
    abc def

  </p>
</body>
</html>

To replace the input file with the modified copy (to emulate sed -i), use something like:

xsltproc --html <(echo "$template") file.html > /tmp/file.$$ && mv /tmp/file.$$ file.html

Abhinandan prasad · Answer 2 · 2017-07-16T19:10:13.450

0

sed usually process line by line(1 line at a time). In this case, We have to read all lines at once. If you are not aware reading all lines at once then you can refer the link.click here.
```
sed -e ':a' -e 'N' -e '$!ba' -e 's/<style.*<\/style>//' file  | awk 'NF'
```
It will return:
```
abc
def
```

Suppose If You have a file like given below:

abc
<style type="text/css">
whatever
1
2
3
</style>
xyz
<style type="text/css">
whatever
1
2
3
</style>
def
<style type="text/css">
whatever
1
2
3
</style>
mno

And you want to print all text outside to tags then

 sed '/<style.*>/,/<\/style>/d' file

It will return:

abc
xyz
def
mno

edited Jul 16 '17 at 19:10

answered Jul 11 '17 at 05:18

Abhinandan prasad

1,009
7
13

Super cool! this is exactly what I need. It works perfectly. Thanks! – Minitab Jul 11 '17 at 05:37
@Abhinandanprasad: Thanks for updating (the `\n*` is still unnecessary). I suggest you make it a little clearer that your 1st command handles only a _single_ `style` element, and, on a minor point, that your 2nd command assumes that the opening and closing tags (``) occur only by themselves on a line (are not preceded or followed by text to preserve and do not both occur on a single line). – mklement0 Jul 16 '17 at 18:35
@mklement0 I updated. For 2nd command, It's not working if tags are in single line. Any Suggestion how to fix it? – Abhinandan prasad Jul 16 '17 at 19:12
@Abhinandanprasad: Thanks; as for how to fix the problem of both the start and end tag appearing on the same line (and text preceding/following the tags): see my answer :) – mklement0 Jul 16 '17 at 19:21
1

@mklement0 Thanks :) – Abhinandan prasad Jul 17 '17 at 06:15

Removing elements by name from an HTML document from the shell - Sed command fails

2 Answers2

Robust solution using XSLT via xsltproc:

Robust solution using XSLT via `xsltproc`: