5

how to remove text between <script> and </script> using python?

RichieHindle
  • 272,464
  • 47
  • 358
  • 399

9 Answers9

27

You can use BeautifulSoup with this (and other) methods:

soup = BeautifulSoup(source.lower())
to_extract = soup.findAll('script')
for item in to_extract:
    item.extract()

This actually removes the nodes from the HTML. If you wanted to leave the empty <script></script> tags you'll have to work with the item attributes rather than just extracting it from the soup.

tgray
  • 8,826
  • 5
  • 36
  • 41
  • 6
    This is the right answer. Niloy, or anyone reading this question, please ignore any of the answers advocating using regular expressions in this case as they all have _serious_, easily exploitable security problems. – Tamas Czinege Jun 08 '09 at 12:43
  • I agree with @DrJokepu. Do not try to parse HTML with regular expressions! – gerdemb Jun 08 '09 at 16:40
  • 2
    I can't get this to work because the text between the script tag contains things like: var str=" – JeremyKun Mar 28 '12 at 22:06
  • This is two years old but I will try and comment. @DrJokepu this would be a good idea but I can't load the html into BeautifulSoup because the javascript has bad html tags in it that throw an error in the parser. I need to use RegEx to strip the javascript first. – Reily Bourne Jun 26 '12 at 22:32
  • Does the source need to be valid HTML? – earthmeLon Jul 16 '14 at 15:30
  • @earthmeLon Nope, that's what makes BeautifulSoup so useful. It does its best to handle errors in the "tag soup" that you give it and present you with a valid DOM model. You can even plug in different HTML parsers which have their own levels of leniency. You can check out the parser comparison [here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser). – tgray Jul 29 '14 at 02:37
  • 1
    from bs4 import BeautifulSoup – Jeremy Leipzig Jun 26 '15 at 17:26
6

Are you trying to prevent XSS? Just eliminating the <script> tags will not solve all possible attacks! Here's a great list of the many ways (some of them very creative) that you could be vulnerable http://ha.ckers.org/xss.html. After reading this page you should understand why just elimintating the <script> tags using a regular expression is not robust enough. The python library lxml has a function that will robustly clean your HTML to make it safe to display.

If you are sure that you just want to eliminate the <script> tags this code in lxml should work:

from lxml.html import parse

root = parse(filename_or_url).getroot()
for element in root.iter("script"):
    element.drop_tree()

Note: I downvoted all the solutions using regular expresions. See here why you shouldn't parse HTML using regular expressions: Using regular expressions to parse HTML: why not?

Note 2: Another SO question showing HTML that is impossible to parse with regular expressions: Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Community
  • 1
  • 1
gerdemb
  • 11,275
  • 17
  • 65
  • 73
2

According to answers posted by Pev and wr, why not to upgrade a regular expression, e.g.:

pattern = r"(?is)<script[^>]*>(.*?)</script>"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '', text)

(?is) - added to ignore case and allow new lines in text. This version should also support script tags with attributes.

EDIT: I can't add any comments yet, so I'm just editing my answer. I totally agree with the comment below, regexps are totally wrong for such tasks and b. soup ot lxml are a lot better. But question asked gave just a simple example and regexps should be enough for such simple task. Using Beautiful Soup for a simple text removing could just be too much (overload? I don't how to express what I mean, excuse my english).

BTW I made a mistake, the code should look like this:

pattern = r"(?is)(<script[^>]*>)(.*?)(</script>)"
text = """<script>foo bar  
baz bar foo  </script>"""
re.sub(pattern, '\1\3', text)
uolot
  • 1,480
  • 1
  • 13
  • 18
0

You can do this with the HTMLParser module (complicated) or use regular expressions:

import re
content = "asdf <script> bla </script> end"
x=re.search("<script>.*?</script>", content, re.DOTALL)
span = x.span() # gives (5, 27)

stripped_content = content[:span[0]] + content[span[1]:]

EDIT: re.DOTALL, thanks to tgray

wr.
  • 2,841
  • 1
  • 23
  • 27
  • 4
    This has lots of potential issues regarding things like case, whether the script tag has attributes, possibly escaped pieces of text, etc. It's pretty hard to cover all the options reliably making it much easier to use existing, tested, libraries such as Beautiful Soup. – mavnn Jun 08 '09 at 11:50
  • You may want to add the re.DOTALL / re.S flag to your search so the 'dot' character matches newlines. Without this, you won't match script blocks that span multiple lines (which are most of them). – tgray Jun 08 '09 at 11:51
  • Unfortunate that a legitimate answer gets down voted; This does meet the necessary specs for sure. doesn't it – lprsd Jun 08 '09 at 16:43
  • @becomingGuru See the two links in my solution for why parsing HTML with regular expressions is a bad idea. While this anser might meet the "specs" of the question, it has serious security problems and is not really a reliable solution. See the notes from 'mavnn' and – gerdemb Jun 08 '09 at 17:07
  • (hit submit too soon) and also 'DrJokepu' pointing out the same problems. – gerdemb Jun 08 '09 at 17:07
  • Personally, I don't think that blindly following the specs is the right approach in this case. Generally, there is no guarantee that the example mentioned in the question will be the only way you will have script tags in your input. Actually, I think that it would be such a special and rare case that the question asker would surely have mentioned it in the question. – Tamas Czinege Jun 08 '09 at 19:51
0

If you're removing everything between <script> and </script> why not just remove the entire node?

Are you expecting a resig-style src and body?

annakata
  • 74,572
  • 17
  • 113
  • 180
0

If you don't want to import any modules:

string = "<script> this is some js. begone! </script>"

string = string.split(' ')

for i, s in enumerate(string):
    if s == '<script>' or s == '</script>' :
        del string[i]

print ' '.join(string)
sqram
  • 7,069
  • 8
  • 48
  • 66
0

Element Tree is the best simplest and sweetest package to do this. Yes, there are other ways to do it too; but don't use any 'coz they suck! (via Mark Pilgrim)

lprsd
  • 84,407
  • 47
  • 135
  • 168
-1

I don't know Python good enough to tell you a solution. But if you want to use that to sanitize the user input you have to be very, very careful. Removing stuff between and just doesn't catch everything. Maybe you can have a look at existing solutions (I assume Django includes something like this).

ujh
  • 4,023
  • 3
  • 27
  • 31
-1
example_text = "This is some text <script> blah blah blah </script> this is some more text."

import re
myre = re.compile("(^.*)<script>(.*)</script>(.*$)")
result = myre.match(example_text)
result.groups()
  <52> ('This is some text ', ' blah blah blah ', ' this is some more text.')

# Text between <script> .. </script>
result.group(2)
  <56> 'blah blah blah'

# Text outside of <script> .. </script>
result.group(1)+result.group(3)
  <57> 'This is some text  this is some more text.'
Simon Peverett
  • 4,128
  • 3
  • 32
  • 37