I am building a scraper using Node.js and Puppeteer. In this case, Puppeteer gets the main content of a page, it is saved as a string, Rss Parser converts it to an RSS feed, an XML file is created, and that file is saved as a physical file containing the scraped content. The problem is if the scraped content contains script elements like Adsense code, it is scraped also. I need a simple regex that will remove any script element along with all of its attributes and all content in between.
I have been looking for a simple example that will allow me to do somethings like:
var content = scrapedcontent;
content = content.replace(myregex, '');
I cannot find an example that works for me. So far the closest things I've found suggest using jQuery. I cannot use jQuery because this is a Node.js project that does not include the jQuery library and I do not want to add jQuery just to strip scripts out of strings.
Also, please do not respond with lectures about what regexes and their characters mean. That is all lorum to me. I just need to find something that says "this is the regex, this is what it does, copy and paste you will be done."