I'm having an issue removing Java Script from HTML. I have put the contents of the HTML into a list and I am wanting to remove any text that is between the <script>
and </script>
tags. Note that a <script>
(or </script>
) tag can have any amounts of whitespace or other text between the <script
(or </script
) portion and the final >
character and valid script tag that must be removed.
So far I have this and it only seems to be removing the <script>
. BTW I am wanting to do this without loading a package.
Thanks in advance.
def clean_JS(full_lists):
indx = 0
html = ''
clean_lists = full_lists
for i in range(len(clean_lists)):
html_full = clean_lists[i][2]
while True:
idx1 = html_full.find('<script', indx)
if idx1 == -1:
break
idx2 = html_full.find('>', idx1 + 1)
if idx2 == -1:
break
idx3 = html_full.find('</script', idx2 + 1)
if idx3 == -1:
break
idx4 = html_full.find('>', idx3 + 1)
if idx4 == -1:
break
html += html_full[indx: idx1]
indx = idx4 + 1
html += html_full[indx:]
clean_lists[i][2] = html
return (html)