I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means @skamazin's answer is correct.