0

I'm having an issue removing Java Script from HTML. I have put the contents of the HTML into a list and I am wanting to remove any text that is between the <script> and </script> tags. Note that a <script> (or </script>) tag can have any amounts of whitespace or other text between the <script (or </script) portion and the final > character and valid script tag that must be removed.

So far I have this and it only seems to be removing the <script>. BTW I am wanting to do this without loading a package.

Thanks in advance.

def clean_JS(full_lists):
    indx = 0
    html = ''
    clean_lists = full_lists
    for i in range(len(clean_lists)):
        html_full = clean_lists[i][2]
        while True:
            idx1 = html_full.find('<script', indx)
            if idx1 == -1:
                break
            idx2 = html_full.find('>', idx1 + 1)
            if idx2 == -1:
                break
            idx3 = html_full.find('</script', idx2 + 1)
            if idx3 == -1:
                break
            idx4 = html_full.find('>', idx3 + 1)
            if idx4 == -1:
                break
            html += html_full[indx: idx1]
            indx = idx4 + 1
        html += html_full[indx:]
        clean_lists[i][2] = html

    return (html)
EvensF
  • 1,479
  • 1
  • 10
  • 17
scout112
  • 49
  • 5
  • Could you provide a [mre] ? Could you also provide an example of the HTML file you are trying to parse? Is there a reason why you don't want to use an existing module? Because you are trying to develop code that will take time that would be better spent using a robust existing library. For example, take a look at [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) – EvensF Oct 28 '20 at 23:39

1 Answers1

0

I would recommend importing the Regular expression operations library (re) to find the text between the substrings <script> and </script> as answered here.

However, if you do not want to load a package, you need to do a few workarounds, as in this example:

before_script = html_full.split('<script>')[0] # Getting whatever comes before the script
after_script = html_full.split('</script>')[1] # Getting whatever comes after the script
clean_html = before_script + after_script # Adding what comes before with what comes after

This works assuming you only have one script in your HTML file. if you have more, you could drop the empty values from the before_script and after_script lists, and add the pairs together.