1

I would like to extract only javascript from script tags in a HTML document which I want to pass it to a JS parser like esprima. I am using nodejs to write this application and have the content extracted from the script tag as a string. The problem is when there are HTML comments in the javascript extracted from html documents which I want to remove.
<!-- var a; --> should be converted to var a
A simple removal of <-- and --> does not work since it fails in the case <!-- if(j-->0); --> where it removes the middle -->
I would also like to remove identifiers like [if !IE] and [endif] which are sometimes found inside script tags. I would also like to extract the JS inside CDATA segments.
<![CDATA[ var a; ]]> should be converted to var a
Is all this possible using a regex or is something more required?
In short I would like to sanitize the JS from script tags so that I can safely pass it into a parser like esprima.
Thanks!

EDIT:
Based on @user568109 's answer. This is the rough code that parses through HTML comments and CDATA segments inside script tags

var htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Pavar htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
        jstext = '';
        //console.log("JS! Hooray!");
    }
},
ontext: function(text) {
    jstext += text;
},
onclosetag: function(tagname) {
    if(tagname === "script") {
        console.log(jstext);
        jstext = '';
    }
},
oncomment : function(data) {
    if(jstext) {
        jstext += data;
    }
}
},  {
xmlMode:true
});
parser.write(input);
parser.end()
everconfusedGuy
  • 2,709
  • 27
  • 43
  • You can use regexes to do this. Try this simple regex in your browser... `"3) -->".replace(/^$/g, "$1")` and see how you can improvise and get your job done. – mohkhan Jul 19 '13 at 07:35
  • Is there a more systematic way of doing this because there seem to be so many cases where there is not valid JS inside script tags. The CDATA and htmlcomments are just some of the cases that I have come across till now. – everconfusedGuy Jul 19 '13 at 07:38
  • `so many cases where there is not valid JS inside script tags.` if it's not valid JS, what's the point of pulling them out? – kennypu Jul 19 '13 at 08:44
  • By valid JS, I mean that they also have HTML comments, etc in them. They will be parsed by the browser properly, but on its own its not valid JS. – everconfusedGuy Jul 19 '13 at 08:48

1 Answers1

0

That is the job of the parser. See the htmlparser2 or esprima itself. Please don't use regex to parse HTML, it is seductive. You will waste your precious time and effort trying to match more tags.

An example from the page:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    onopentag: function(name, attribs){
        if(name === "script" && attribs.type === "text/javascript"){
            console.log("JS! Hooray!");
        }
    },
    ontext: function(text){
        console.log("-->", text);
    },
    onclosetag: function(tagname){
        if(tagname === "script"){
            console.log("That's it?!");
        }
    }
});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</script>");
parser.end();

Output (simplified):

--> Xyz 
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

It will give you all the tags divs, comments, scripts etc. But you would have to validate the script inside the comments yourself. Also CDATA is a valid tag in XML(XHTML), so htmlparser2 would detect it as a comment, you would have to check those too.

Community
  • 1
  • 1
user568109
  • 47,225
  • 17
  • 99
  • 123