1

I want to delete all <scripts> in all html files in all subfolders. I can't find the correct version of the line

regular expression: <script[\w\W]*?</script>

here's how it looks in the line for my reasons:

find . -type f -name «*.html» -exec sed -i 's/<script[\w\W]*?</script>//g' {} \;

I also tried it on every screening down to: \<script\[\\w\\W\]\*\?\<\/script\>

this doesn't work

There is another option

find -type f -name \*.html | xargs sed -i '/\<script/,/\<\/script\>/c\ '

but it deletes all the contents of the page from the first script to the last. All I need to delete only <script ....</script>

Maybe grep can do it?

Allan
  • 12,117
  • 3
  • 27
  • 51
Canapsis
  • 21
  • 1
  • 1
    Are all your `` located on the same line? I would recommend you to use a XSLT on your html files though that's the cleanest way – Allan Apr 01 '19 at 08:58
  • Regular expressions are catastrophically ill-suited to parsing structured formats. This is basically an amplification of the preceding comment. – tripleee Apr 01 '19 at 09:08
  • Are you really using guillemets `«*.html»` to quote shell meta characters? That will never work. You MUST use ASCII double quotes, ASCII single quotes, or backslashes. – Jens Apr 01 '19 at 09:35

3 Answers3

2

Using regex to parse HTML or XML files is essentially not done (see here and here). Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

In case of parsing XML or HTML, one can easily use xmlstarlet.

xmlstarlet ed -d '//script'

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet ed -N "x=http://www.w3.org/1999/xhtml" \
               -d '//script'

where -N gives the XHTML namespace if any, this is recognised by

<html xmlns="http://www.w3.org/1999/xhtml">

In the XHTML output of tidy.

kvantour
  • 25,269
  • 4
  • 47
  • 72
2

Example of file:

$ more input.html 
<!DOCTYPE html>
<html>
  <head>
    <title>Title of the document</title>
  </head>
  <body>
    <p id="example"></p>
    <script>
      document.getElementById("example").innerHTML = "My first JavaScript code";
    </script>
  </body>
</html>

Example of stylesheet:

$ more removescript.xsl 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">

    <xsl:output method="html" encoding="utf-8" indent="yes"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="//script" />

</xsl:stylesheet>

Command:

$ xsltproc --html removescript.xsl input.html 
<html>
  <head>
    <title>Title of the document</title>
  </head>
  <body>
    <p id="example"/>

  </body>
</html>

Explanations:

The stylesheet will copy every single node and attribute, when it matches the node <script> </script> it will do nothing (no copy) therefore those nodes will be removed in the result.

Allan
  • 12,117
  • 3
  • 27
  • 51
0

I found simple solution:

find . -type f -name "*.html" -exec perl -0 -i -pe 's/<script.*?script>//gs' {} \;

Canapsis
  • 21
  • 1