1

I am building a scraper using Node.js and Puppeteer. In this case, Puppeteer gets the main content of a page, it is saved as a string, Rss Parser converts it to an RSS feed, an XML file is created, and that file is saved as a physical file containing the scraped content. The problem is if the scraped content contains script elements like Adsense code, it is scraped also. I need a simple regex that will remove any script element along with all of its attributes and all content in between.

I have been looking for a simple example that will allow me to do somethings like:

var content = scrapedcontent;
content = content.replace(myregex, '');

I cannot find an example that works for me. So far the closest things I've found suggest using jQuery. I cannot use jQuery because this is a Node.js project that does not include the jQuery library and I do not want to add jQuery just to strip scripts out of strings.

Also, please do not respond with lectures about what regexes and their characters mean. That is all lorum to me. I just need to find something that says "this is the regex, this is what it does, copy and paste you will be done."

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 4
    Re: your last paragraph, *God forbid* we ask you to learn anything or do any of the work.... – Jared Smith Mar 20 '21 at 01:29
  • 1
    I don't see why you wouldn't want to use jQuery. It's perfect for things like this as you're manipulating the DOM, and your reasoning of _I don't want to add jQuery_ is amusing when it's a sensible solution to your problem. Using regex to do things with HTML is generally considered to be a [bad idea](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) anyways. I found something [that might be of use to you, though.](https://stackoverflow.com/questions/45262311/remove-specific-html-tag-with-its-content-from-javascript-string) – Voltaire Mar 20 '21 at 02:04
  • 2
    You should probably know that ` – Ouroborus Mar 20 '21 at 02:12
  • 2
    Anything that could be written using jQuery can also be written without jQuery. What's the link to the jQuery example? – Ouroborus Mar 20 '21 at 02:13
  • this seems to work for opening tags and what followed except the closing one / – PostAlmostAnything Mar 20 '21 at 02:45
  • Ouroborus - how can you add a script without a script element? – PostAlmostAnything Mar 20 '21 at 02:45
  • This seems to be working .replace(/ – PostAlmostAnything Mar 20 '21 at 04:13
  • I also added something to remove all instances of the string javascript: because that appears to be a way to execute javascript without a script tag according to https://owasp.org/www-community/xss-filter-evasion-cheatsheet – PostAlmostAnything Mar 20 '21 at 04:14
  • There is no simple regex that will remove all possible Javascript. HTML and scripts embedded in HTML tags are not a perfect match for a regex. – jfriend00 Mar 20 '21 at 04:28
  • jfriend00 - I know that now, but fortunately I also use WPRobot to parse the RSS feed results and that seems pretty good at combating malicious scripts. At least I have never seen the WPRobot feature that gets the main content of a URL ever do so in a way that results in scripts embedded in that content running as far as I know. I have seen images, links, and sometimes inputs make it through, so I am trying to remove all script tags, style tags, and input tags before submitting the content to WPRobot. – PostAlmostAnything Mar 20 '21 at 23:10

1 Answers1

2

Use https://www.npmjs.com/package/cherio

Implementation of core jQuery designed specifically for the server.

get the element in jQuery style and get rid of them

const cheerio = require('cherio')
const $ = cheerio.load(scrapedcontent);
$('.abc').remove(); // your selector
const newHtml = $.html();
Tushar Gupta - curioustushar
  • 58,085
  • 24
  • 103
  • 107
  • My question specifically said " this is a Node.js project that does not include the jQuery library and I do not want to add jQuery just to strip scripts out of strings." Yet the first answer I get involves exactly what I asked anyone answering not to do. I am not looking for a way to import jQuery just to strip script tags. I am looking for a regex that works with regular javascript. – PostAlmostAnything Mar 20 '21 at 01:41
  • 2
    @PostAlmostAnything - There is no jQuery in this answer. This is the cheerio library which has it's own jQuery-like functionality designed for server use on parsed HTML content. – jfriend00 Mar 20 '21 at 04:24