4

I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.

<body>
<script src="...">

    </script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">

<script type="text/javascript" language="javascript">

//&lt;![CDATA[

function CreateFixedHeaders() {}//]]&gt;
</script>
<script>

            var ClientReportfb64a4706a3749c484169e...
        </script>
</body>

My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.

The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).

Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.

So for removing the attributes I have:

script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)

As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.

Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:

<script(.*)</script>

The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.

I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.

So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.

Thanks

Update:

It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?

<script(.*)</script>

As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.

I hope this clears up my question some

Update 2

I just wanted to add a few more notes about what I am doing.

I am crawling a web site to get the data I need.

Once we have the page that contains the data we need it is saved to the database.

Then the saved web page is displayed to the user.

The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).

The error in question only happens in this one page for this one web site.

Update 3

After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.

The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.

So the solution I'm looking at is

import re


def strip_script_tags(page_source: str) -> str:
    pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
    result = re.sub(pattern, "", page_source) 
    pattern2 = re.compile(r'<script[\s\S]+?/script>')
    result = re.sub(pattern2, "", result)
    return result

I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means @skamazin's answer is correct.

Community
  • 1
  • 1
  • Try [this](http://stackoverflow.com/questions/6659351/removing-all-script-tags-from-html-with-js-regular-expression?rq=1) thread, I think it could help you. – Tomás Cot Aug 04 '14 at 14:45
  • @TomasCot Unfortunately that is JavaScript, I am attempting to do this in Python. I know BeautifulSoup has an API that is similar to JS and could use it that way. However, I cannot use BeautifulSoup, I need to use pure Python for this. If Python has the same style of DOM API as that, I could make it work but I'm not aware of anything other than BeautifulSoup that gives you a JS style DOM API. If I am incorrect, please correct me. Or were you speaking to the Regex that is extracted from jquery in that post? –  Aug 04 '14 at 14:52
  • @user3752226 It seems like you know what you're doing with the regex, so what's the question about ` – skamazin Aug 04 '14 at 15:07
  • @user3752226, I was talking about the RegExp answer. – Tomás Cot Aug 04 '14 at 15:10
  • 1
    Why can't you use BeautifulSoup and what does "pure python" mean to you? BeautifulSoup is written in pure python, it has no C extension as far as I'm aware. Do you mean standard library only? – Jason S Aug 04 '14 at 17:21
  • @JasonS What I mean is I can only use the standard library –  Aug 04 '14 at 17:46
  • @JasonS That is an option. I can't believe I didn't think about that before. Thanks. –  Aug 04 '14 at 18:01
  • Actually, I have to revise my point, only the old py2-compatible BeautifulSoup is one file. – Jason S Aug 04 '14 at 18:08

1 Answers1

4

As for removing all the attributes that start with on, you can try this

It uses the regex:

\s?on\w+="[^"]+"\s?

And substitutes with the empty string (deletion). So in Python it should be:

pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file) 

If you are trying to match anything between the script tags try:

<script[\s\S]+?/script>

DEMO

The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.

skamazin
  • 757
  • 5
  • 12
  • Thank you for the RegEx tips. However, my question is not about using RegEx. My question is how to parse out the script tags without using Regular Expression. –  Aug 04 '14 at 16:18
  • OHHH! I don't know enough Python to help you there but I can point you in [this direction](https://docs.python.org/2/library/xml.etree.elementtree.html). I'm sorry if this doesn't help, but I've seen a lot of people suggest this type of approach for an issue with tags. – skamazin Aug 04 '14 at 16:21