0

Possible Duplicate:
Extract data from HTML table with BASH script

I have an html file that contains the following content. I want to use sed to remove all the content (multiline) between the patterns < script ..... > and </script> and leave the rest as it is. I also want to remove the tags.

Any help would be appreciated. thanks! I tried both of the following but with no luck.

cat test.html | tr -d '\n' | sed 's/< script.*<\/script>//g' > output.txt

and

sed '/< script/,/<\/script>/d' test.html > output.txt    

don't touch this.

this is not to be removed < script bla bla> this is to be

removed. < /script> this is going to

stay < script bla bla bla bla bla> remove this

and this 

and this < /script> and this stays as is.

this too.

Community
  • 1
  • 1
The Coder
  • 557
  • 1
  • 7
  • 10
  • Could both `` appear in the same line? – Birei Sep 25 '12 at 16:38
  • 2
    apparently second most popular question on stackoverflow - "how to remove .. sed .. between two patterns?" :) http://stackoverflow.com/search?q=sed+patterns – Piotr Wadas Sep 25 '12 at 16:39

2 Answers2

0

What about:

cat yourfile | tr -d '\n' | sed -e 's,< script.*< /script>,,g'

Note the space in ending tag

Stephane Rouberol
  • 4,286
  • 19
  • 18
  • 1
    Useless use of `cat` (`tr -d '\n' yourfile`). And you are using a greedy regex, so it can delete something you might want to leave untouched. And see this answer: http://stackoverflow.com/a/1732454/11621 – Zsolt Botykai Sep 25 '12 at 16:58
  • `cat` (or tr < yourfile) seems necessary with certain version of tr like tr (GNU coreutils) 8.9 – Stephane Rouberol Sep 25 '12 at 18:11
0

This might work for you (GNU sed):

sed ':a;$!{N;ba};/\x00/q1;s/<\s*\/\?script[^>]*>/\x00/g;s/\x00[^\x00]*\x00//g' file

There is a vague chance it might fail because the HTML file contains the hexcode \x00 in which case the return code will be 1 and the file output will be unchanged.

Explanation:

  • :a;$!{N;ba} slurp the file into the pattern space
  • /\x00/q1 check the file for hexcode \x00 and if found quit with return code of 1
  • s/<\s*\/\?script[^>]*>/\x00/g replace all script start and end tags with \x00
  • s/\x00[^\x00]*\x00//g remove everything between \x00's
potong
  • 55,640
  • 6
  • 51
  • 83