Python - Remove HTML-tag with regex

Question

This usually is no hard task, but today I can't seem to remove a simple javascript tag..

The example I'm working with (formated):

<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
    window.addEventListener('DOMContentLoaded', function(){
        window.postscribe && postscribe(document.querySelector(".realestate"),
        '<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
    });
</script>

The example I'm working with (raw)

<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>

I would like to remove everything from <script(beginning of second line) to </script>(last line). This will output only the first line, <section..>.

Here's my line of code:

re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)

I'm clearly missing something, but I can't see what.
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.

You should know this `[^]` doesn't mean anything except a closing script tag. — revo, Feb 13 '17 at 14:13
@glibdud I agree, I was only trying to flag it. http://meta.stackoverflow.com/q/343643/1561176 — Inbar Rose, Feb 13 '17 at 14:21
I think that you should take a look at this answer to using regex to parse "html" http://stackoverflow.com/a/1732454/1561176 . Instead you should be using the correct parser, such as BeautifulSoup. https://www.crummy.com/software/BeautifulSoup/ — Inbar Rose, Feb 13 '17 at 14:26
@revo Well, if I knew, I wouldn't be asking. Either way, I read somewhere it ment "anything except this" and I'm using it a lot like this ´<[^>]+>´ . — theusual, Feb 13 '17 at 14:26
@InbarRose That made an impression I won't forget. I don't think my document will be able to be parsed, I see it more fit to manually index the tags, group them and then delete everything inbetween. — theusual, Feb 13 '17 at 14:36

glibdud · Accepted Answer · 2017-02-13T14:38:57.140

3

Your first regex didn't work because character classes ([...]) are a collection of characters, not a string. So it will only match if it finds <script separated from </script> by a string of characters that doesn't include any of <, /, s, c, etc.

Your second regex is better, and the only reason it's not working is because by default, the . wildcard does not match newlines. To tell it you want it to, you'll need to add the DOTALL flag:

re.sub(r'<script.+?</script>', '', text, flags=re.DOTALL)

edited Feb 13 '17 at 14:38

answered Feb 13 '17 at 14:28

glibdud

7,550
4
27
37

Amazing. Thanks for giving an explanation to why it didn't work! – theusual Feb 13 '17 at 14:58

Python - Remove HTML-tag with regex

1 Answers1