sed - Include newline in pattern

Question

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">

find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done

The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:

<script>
//Foo
</script>

will remove the first script tag but will omit the "foo" and closing tag which I don't want.

Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?

score 6 · Accepted Answer · answered Jul 16 '13 at 08:33

6

Assuming that you have <script> tags on different lines, e.g. something like:

foo
bar
<script type="text/javascript">
some JS
</script>
foo

the following should work:

sed '/<script/,/<\/script>/d' inputfile

answered Jul 16 '13 at 08:33

devnull

118,548
33
236
227

Tested. It works. The only problem that I have with it is that characters on the same line are replaced along with the script tag. e.g. foo – GoofyBall Jul 16 '13 at 19:38

score 1 · Answer 2 · answered Jul 16 '13 at 08:29

This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.

awk '/<script.*>/   { in=1; next }
     /<\/script.*>/ { if (in) in=0; next }
    { if (!in) print; } ' $1

score 0 · Answer 3 · edited May 23 '17 at 10:30

As you mentioned, the issue is that sed processes input line by line.

The simplest workaround is therefore to make the input a single line, e.g. replacing newlines with a character which you are confident doesn't exist in your input.

One would be tempted to use tr :

… |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'

However "currently tr fully supports only single-byte characters", and to be safe you probably want to use some improbable character like ˇ, for which tr is of no help.

Fortunately, the same thing can be achieved with sed, using branching.

Back on our <script>…</script> example, this does work and would be (according to the previous link) cross-platform :

… |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'

Or in a more condensed form if you use GNU sed and don't need cross-platform compatibility :

… |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'

Please refer to the linked answer under "using branching" for details about the branching part (:a;N;$!ba;). The remaining part is straightforward :

s/\n/ˇ/g replaces all newlines with ˇ ;
s~<script>.*</script>~~g removes what needs to be removed (beware that it requires some securing for actual use : as is it will delete everything between the first <script> and the last </script> ; also, note that I used ~ instead of / to avoid escaping of the slash in </script> : I could have used just about any single-byte character except a few reserved ones like \) ;
s/ˇ/\n/g readds newlines.

Note that if you need to perform operations which do not depend on the branching, it may be wiser to pipe `sed` output to a new `sed` instance (I myself encountered issues with some operations working within the same instance, others not). — Skippy le Grand Gourou, Mar 28 '17 at 10:11

sed - Include newline in pattern

3 Answers3