This usually is no hard task, but today I can't seem to remove a simple javascript tag..
The example I'm working with (formated):
<section class="realestate oca"></section>
<script type="text/javascript" data-type="ad">
window.addEventListener('DOMContentLoaded', function(){
window.postscribe && postscribe(document.querySelector(".realestate"),
'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\/script>');
});
</script>
The example I'm working with (raw)
<section class="realestate oca"></section>\n<script type="text/javascript" data-type="ad">\n\twindow.addEventListener(\'DOMContentLoaded\', function(){\n\t\twindow.postscribe && postscribe(document.querySelector(".realestate"),\n\t\t\'<script src="https://ocacache-front.schibsted.tech/public/dist/oca-loader/js/ocaloader.js?type=re&w=100%&h=300"><\\/script>\');\n\t});\n</script>
I would like to remove everything from <script
(beginning of second line) to </script>
(last line). This will output only the first line, <section..>
.
Here's my line of code:
re.sub(r'<script[^</script>]+</script>', '', text)
#or
re.sub(r'<script.+?</script>', '', text)
I'm clearly missing something, but I can't see what.
Note: The document I'm working with contains mainly plain text so no parsing with lxml or similar is needed.