sed to remove content between two patterns

Question

Possible Duplicate:
Extract data from HTML table with BASH script

I have an html file that contains the following content. I want to use sed to remove all the content (multiline) between the patterns < script ..... > and </script> and leave the rest as it is. I also want to remove the tags.

Any help would be appreciated. thanks! I tried both of the following but with no luck.

cat test.html | tr -d '\n' | sed 's/< script.*<\/script>//g' > output.txt

and

sed '/< script/,/<\/script>/d' test.html > output.txt

don't touch this.

this is not to be removed < script bla bla> this is to be

removed. < /script> this is going to

stay < script bla bla bla bla bla> remove this

and this 

and this < /script> and this stays as is.

this too.

apparently second most popular question on stackoverflow - "how to remove .. sed .. between two patterns?" :) http://stackoverflow.com/search?q=sed+patterns — Piotr Wadas, Sep 25 '12 at 16:39

score 0 · Answer 1 · answered Sep 25 '12 at 16:44

0

What about:

cat yourfile | tr -d '\n' | sed -e 's,< script.*< /script>,,g'

Note the space in ending tag

answered Sep 25 '12 at 16:44

Stephane Rouberol

4,286
19
18

1

Useless use of `cat` (`tr -d '\n' yourfile`). And you are using a greedy regex, so it can delete something you might want to leave untouched. And see this answer: http://stackoverflow.com/a/1732454/11621 – Zsolt Botykai Sep 25 '12 at 16:58
`cat` (or tr < yourfile) seems necessary with certain version of tr like tr (GNU coreutils) 8.9 – Stephane Rouberol Sep 25 '12 at 18:11

score 0 · Answer 2 · answered Sep 26 '12 at 07:13

This might work for you (GNU sed):

sed ':a;$!{N;ba};/\x00/q1;s/<\s*\/\?script[^>]*>/\x00/g;s/\x00[^\x00]*\x00//g' file

There is a vague chance it might fail because the HTML file contains the hexcode \x00 in which case the return code will be 1 and the file output will be unchanged.

Explanation:

:a;$!{N;ba} slurp the file into the pattern space
/\x00/q1 check the file for hexcode \x00 and if found quit with return code of 1
s/<\s*\/\?script[^>]*>/\x00/g replace all script start and end tags with \x00
s/\x00[^\x00]*\x00//g remove everything between \x00's

sed to remove content between two patterns

2 Answers2