7

I need to extract an entire javascript function from a script file. I know the name of the function, but I don't know what the contents of the function may be. This function may be embedded within any number of closures.

I need to have two output values:

  1. The entire body of the named function that I'm finding in the input script.
  2. The full input script with the found named function removed.

So, assume I'm looking for the findMe function in this input script:

function() {
  function something(x,y) {
    if (x == true) {
      console.log ("Something says X is true");
      // The regex should not find this:
      console.log ("function findMe(z) { var a; }");
    }
  }
  function findMe(z) {
    if (z == true) {
      console.log ("Something says Z is true");
    }
  }
  findMe(true);
  something(false,"hello");
}();

From this, I need the following two result values:

  1. The extracted findMe script

    function findMe(z) {
      if (z == true) {
        console.log ("Something says Z is true");
      }
    }
    
  2. The input script with the findMe function removed

    function() {
      function something(x,y) {
        if (x == true) {
          console.log ("Something says X is true");
          // The regex should not find this:
          console.log ("function findMe(z) { var a; }");
        }
      }
      findMe(true);
      something(false,"hello");
    }();
    

The problems I'm dealing with:

  1. The body of the script to find could have any valid javascript code within it. The code or regex to find this script must be able to ignore values in strings, multiple nested block levels, and so forth.

  2. If the function definition to find is specified inside of a string, it should be ignored.

Any advice on how to accomplish something like this?

Update:

It looks like regex is not the right way to do this. I'm open to pointers to parsers that could help me accomplish this. I'm looking at Jison, but would love to hear about anything else.

Tauren
  • 26,795
  • 42
  • 131
  • 167
  • 1
    You need to have it done with javascript or you can use some other language (eg. python)? – Matteo Ceccarello Jul 05 '11 at 21:23
  • I'm parsing javascript files on the server, but I'm doing it in node.js. So it would be best for it to be javascript doing it. I'm looking at Jison right now as a possible solution: http://zaach.github.com/jison/ – Tauren Jul 05 '11 at 21:34
  • I've just updated the question to not be regex specific. Basically, I'm looking for a solution to the problem, whether the solution invovles regex or not doesn't matter. – Tauren Jul 05 '11 at 21:40
  • perhaps you should try to look for the function name using regular expressions, then use a stack to select the function body: you parse the file from where you found the function name, pushing a "{" (or anything else) in the stack when you find one, and popping a symbol from the stack when you find a "}". When the stack gets empty, you have reached the end of the function body, and you are done. It surely isn't efficient nor extremely elegant, but it may be a solution. – Matteo Ceccarello Jul 05 '11 at 21:56
  • It'll break on any string declaration with an opening or closing bracket that isn't handled. Or comment. Or anything that lets you not close a bracket and remain valid. I don't think there are simple solutions, just gotta buckle down and write a (simple) parser. – NorthGuard Jul 05 '11 at 22:50
  • In the example you post, the script will be broken after you remove the `findMe()` function, because there is still code that calls it. Also, given your requirement to find the function even if embedded within other functions, are you sure there won't be more than one function with the name you are looking for? – nnnnnn Jul 05 '11 at 23:57
  • @nnnnnn Yes, I realize this seems like a strange thing to do. Basically, I have a bunch of very similar javascript files that all define the same functions. Each file has some unique code after defining these common functions. I'd like to pull all of the common functions out from each file. Then I'd write a new file that starts a closure, includes **one** copy of the common functions, and then all of the other code from all the files. The common functions take up far more size than the rest of each file, so this will significantly reduce the overall size to something downloadable. – Tauren Jul 06 '11 at 00:23

5 Answers5

3

A regex can't do this. What you need is a tool that parses JavaScript in a compiler-accurate way, builds up a structure representing the shape of the JavaScript code, enables you to find the function you want and print it out, and enables you to remove the function definition from that structure and regenerate the remaining javascript text.

Our DMS Software Reengineering Toolkit can do this, using its JavaScript front end. DMS provides general parsing, abstract syntax tree building/navigating/manipulation, and prettyprinting of (valid!) source text from a modified AST. The JavaScript front end provides DMS with compiler-accurate definition of JavaScript. You can point DMS/JavaScript at a JavaScript file (or even various kinds of dynamic HTML with embedded script tags containing JavaScript), have it produce the AST. A DMS pattern can be used to find your function:

  pattern find_my_function(r:type,a: arguments, b:body): declaration
     " \r my_function_name(\a) { \b } ";

DMS can search the AST for a matching tree with the specified structure; because this is an AST match and not a string match, line breaks, whitespace, comments and other trivial differences won't fool it. [What you didn't say is what to if you have more than one function in different scopes: which one do you want?]

Having found the match, you can ask DMS to print just that matched code which acts as your extraction step. You can also ask DMS to remove the function using a rewrite rule:

  rule remove_my_function((r:type,a: arguments, b:body): declaration->declaration
     " \r my_function_name(\a) { \b } " -> ";";

and then prettyprint the resulting AST. DMS will preserve all the comments properly.

What this does not do, is check that removing the function doesn't break your code. After all, it may be in a scope where it directly accesses variables defined locally in the scope. Removing it to another scope now means it can't reference its variables.

To detect this problem, you not only need a parser, but you need a symbol table with maps identifiers in the code to definitions and uses. The removal rule then has to add a semantic condition to check for this. DMS provides the machinery to build such a symbol table from the AST using an attribute grammar.

To fix this problem, when removing the function, it may be necessary to modify the function to accept additional arguments replacing the local variables it accesses, and modify the call sites to pass in what amounts to references to the local variables. This can be implemented with a modest sized set of DMS rewrite rules, that check the symbol tables.

So removing such a function can be a lot more complex than just moving the code.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
2

I built a solution in C# using plain old string methods (no regex) and it works for me with nested functions as well. The underlying principle is in counting braces and checking for unbalanced closing braces. Caveat: This won't work for cases where braces are part of a comment but you can easily enhance this solution by first stripping out comments from the code before parsing function boundaries.

I first added this extension method to extract all indices of matches in a string (Source: More efficient way to get all indexes of a character in a string)

    /// <summary>
    /// Source: https://stackoverflow.com/questions/12765819/more-efficient-way-to-get-all-indexes-of-a-character-in-a-string
    /// </summary>
    public static List<int> AllIndexesOf(this string str, string value)
    {
        if (String.IsNullOrEmpty(value))
            throw new ArgumentException("the string to find may not be empty", "value");
        List<int> indexes = new List<int>();
        for (int index = 0; ; index += value.Length)
        {
            index = str.IndexOf(value, index);
            if (index == -1)
                return indexes;
            indexes.Add(index);
        }
    }

I defined this struct for easy referencing of function boundaries:

    private struct FuncLimits
    {
        public int StartIndex;
        public int EndIndex;
    }

Here's the main function where I parse the boundaries:

    public void Parse(string file)
    {
        List<FuncLimits> funcLimits = new List<FuncLimits>();

        List<int> allFuncIndices = file.AllIndexesOf("function ");
        List<int> allOpeningBraceIndices = file.AllIndexesOf("{");
        List<int> allClosingBraceIndices = file.AllIndexesOf("}");

        for (int i = 0; i < allFuncIndices.Count; i++)
        {
            int thisIndex = allFuncIndices[i];
            bool functionBoundaryFound = false;

            int testFuncIndex = i;
            int lastIndex = file.Length - 1;

            while (!functionBoundaryFound)
            {
                //find the next function index or last position if this is the last function definition
                int nextIndex = (testFuncIndex < (allFuncIndices.Count - 1)) ? allFuncIndices[testFuncIndex + 1] : lastIndex;

                var q1 = from c in allOpeningBraceIndices where c > thisIndex && c <= nextIndex select c;
                var qTemp = q1.Skip<int>(1); //skip the first element as it is the opening brace for this function

                var q2 = from c in allClosingBraceIndices where c > thisIndex && c <= nextIndex select c;

                int q1Count = qTemp.Count<int>();
                int q2Count = q2.Count<int>();

                if (q1Count == q2Count && nextIndex < lastIndex)
                    functionBoundaryFound = false; //next function is a nested function, move on to the one after this
                else if (q2Count > q1Count)
                {
                    //we found the function boundary... just need to find the closest unbalanced closing brace 
                    FuncLimits funcLim = new FuncLimits();
                    funcLim.StartIndex = q1.ElementAt<int>(0);
                    funcLim.EndIndex = q2.ElementAt<int>(q1Count);
                    funcLimits.Add(funcLim);

                    functionBoundaryFound = true;
                }
                testFuncIndex++;
            }
        }
    }
Community
  • 1
  • 1
byteian
  • 21
  • 3
2

If the script is included in your page (something you weren't clear about) and the function is publicly accessible, then you can just get the source to the function with:

functionXX.toString();

https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/toString

Other ideas:

1) Look at the open source code that does either JS minification or JS pretty indent. In both cases, those pieces of code have to "understand" the JS language in order to do their work in a fault tolerant way. I doubt it's going to be pure regex as the language is just a bit more complicated than that.

2) If you control the source at the server and are wanted to modify a particular function in it, then just insert some new JS that replaces that function at runtime with your own function. That way, you let the JS compiler identify the function for you and you just replace it with your own version.

3) For regex, here's what I've done which is not foolproof, but worked for me for some build tools I use:

I run multiple passes (using regex in python):

  1. Remove all comments delineated with /* and */.
  2. Remove all quoted strings
  3. Now, all that's left is non-string, non-comment javascript so you should be able to regex directly on your function declaration
  4. If you need the function source with strings and comments back in, you'll have to reconstitute that from the original, now that you know the begin end of the function

Here are the regexes I use (expressed in python's multi-line format):

reStr = r"""
    (                               # capture the non-comment portion
        "(?:\\.|[^"\\])*"           # capture double quoted strings
        |
        '(?:\\.|[^'\\])*'           # capture single quoted strings
        |
        (?:[^/\n"']|/[^/*\n"'])+    # any code besides newlines or string literals
        |
        \n                          # newline
    )
    |
    (/\*  (?:[^*]|\*[^/])*   \*/)       # /* comment */
    |
    (?://(.*)$)                     # // single line comment
    $"""    

reMultiStart = r"""         # start of a multiline comment that doesn't terminate on this line
    (
        /\*                 # /* 
        (
            [^\*]           # any character that is not a *
            |               # or
            \*[^/]          # * followed by something that is not a /
        )*                  # any number of these
    )
    $"""

reMultiEnd = r"""           # end of a multiline comment that didn't start on this line
    (
        ^                   # start of the line
        (
            [^\*]           # any character that is not a *
            |               # or
            \*+[^/]         # * followed by something that is not a /
        )*                  # any number of these
        \*/                 # followed by a */
    )
"""

regExSingleKeep = re.compile("// /")                    # lines that have single lines comments that start with "// /" are single line comments we should keep
regExMain = re.compile(reStr, re.VERBOSE)
regExMultiStart = re.compile(reMultiStart, re.VERBOSE)
regExMultiEnd = re.compile(reMultiEnd, re.VERBOSE)

This all sounds messy to me. You might be better off explaining what problem you're really trying to solve so folks can help find a more elegant solution to the real problem.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Thanks, I am aware of this. But in this case, this is being done on the server and I'm just processing plain text. I don't have the javascript included in a page. – Tauren Jul 05 '11 at 21:36
  • OK, I added more options to my answer. – jfriend00 Jul 05 '11 at 21:56
  • @jfriend00, I don't see that regex handle JS regex quoting anywhere. ;-) – Qtax Jul 05 '11 at 22:40
  • Also, your comments will break on `/* foo **/`. – Qtax Jul 05 '11 at 22:42
  • This is a start that I use on my own code (for partial minimizing of my JS that retains max debuggability) where I don't do things in the JS that break it. I say in my answer that it's not foolproof. I'm just sharing the regex stuff that I have. Anyone else is free to either use as is, improve it or share something you have that works better. – jfriend00 Jul 05 '11 at 22:55
  • I should add that if anyone wants to propose improvements to these regular expressions, I'd be happy to include them in my answer post to improve the answer (and use them in my own code). – jfriend00 Jul 05 '11 at 23:04
  • 1
    +1 Good idea to look at existing minifiers and other code parsing tools. I also suggest looking at JSLint. – davin Jul 05 '11 at 23:10
  • @jfriend: +1, thanks for the ideas. I can't believe I didn't think to look at minification libraries! In fact, I'm already using UglifyJS to minify this same javascript that I'm trying to extract a function from. It looks like it creates an intermediary AST, so I should be able to use it to get what I need! – Tauren Jul 05 '11 at 23:32
  • If anyone is still looking into how to do this (it's now 2017), I'd recommend getting familiar with Babel. You could write a [babel plugin](https://github.com/thejameskyle/babel-handbook/blob/master/translations/en/plugin-handbook.md) to parse, transform, and generate the desired output. – Tauren Feb 09 '17 at 01:37
1

I am almost afraid that regex cannot do this job. I think it is the same as trying to parse XML or HTML with regex, a topic that has already caused various religious debates on this forum.

EDIT: Please correct me if this is NOT the same as trying to parse XML.

Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • Though this doesn't answer the question, I give it +1 since this would be an abuse of regular expressions. Regular expressions are never meant to parse input that is recursive by nature (a programming language); – Ruan Mendes Jul 05 '11 at 21:28
  • That's what I was afraid of too. I'm looking into Jison to see if it would help. http://zaach.github.com/jison/ – Tauren Jul 05 '11 at 21:35
  • NP-complete, huh? What does that have to do with anything? Javascript syntax is a context free grammar, as are most languages. Parsing is very much a deterministic polynomial time operation. In addition, your reduction to generic parsing is incorrect (it's doesn't constitute a valid iff since it's unidirectional). While being able to parse would solve this problem, this is a subproblem of the parsing problem, and potentially easier, even potentially solvable by a regular expression. – davin Jul 05 '11 at 22:49
  • @davin You stated that JS syntax is a context free grammar (hence ir can be represented/written by/validated by a pushdown automaton). Regular expressions are regular and hence are equivalent to finite state automata. A finite state automaton cannot process a context free grammar. – Hyperboreus Jul 06 '11 at 13:32
  • @davin About parsing non regular input with regular expressions, confer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Hyperboreus Jul 06 '11 at 13:37
  • Hyperboreus, I stated that js is a CFG to explain why the question has nothing to do with the class NPC. Regarding the rest of your comments, I thought I was quite clear: you're assuming that the OP's problem is equivalent to being able to parse JS, but it isn't. As I stated, your assumption is a reduction, although it's incorrect. I agree that CFG parsing in general is not solvable with regular expressions, although (1) that doesn't mean that simple regular expressions with some free language aren't powerful enough and (2) the OP's problem is **not** generic parsing. – davin Jul 06 '11 at 14:23
  • @davin Thank you for the comment. I agree on your ref (1). Especially as today most implementations of regular expressions are NOT regular, but have climbed the ladder of the Chomsky taxonomy to the next rung (back-reference, etc). – Hyperboreus Jul 06 '11 at 21:48
-1

I guess you would have to use and construct a String-Tokenizer for this job.

function tokenizer(str){
  var stack = array(); // stack of opening-tokens
  var last = ""; // last opening-token

  // token pairs: subblocks, strings, regex
  var matches = {
    "}":"{",
    "'":"'",
    '"':'"',
    "/":"/"
  };

  // start with function declaration
  var needle = str.match(/function[ ]+findme\([^\)]*\)[^\{]*\{/);

  // move everything before needle to result
  var result += str.slice(0,str.indexOf(needle));
  // everithing after needle goes to the stream that will be parsed
  var stream = str.slice(str.indexOf(needle)+needle.length);

  // init stack
  stack.push("{");
  last = "{";

  // while still in this function
  while(stack.length > 0){

    // determine next token
    needle = stream.match(/(?:\{|\}|"|'|\/|\\)/); 

    if(needle == "\\"){
      // if this is an escape character => remove escaped character
      stream = stream.slice(stream.indexOf(needle)+2);
      continue;

    }else if(last == matches[needle]){
      // if this ends something pop stack and set last
      stack.pop();
      last = stack[stack.length-1];

    }else if(last == "{"){  
      // if we are not inside a string (last either " or ' or /)
      // push needle to stack
      stack.push(needle);
      last = needle;
    }

    // cut away including token
    stream = stream.slice(stream.indexOf(needle)+1);
  }

  return result + stream;
}

oh, I forgot tokens for comments... but i guess you got an idea now of how it works...

zuloo
  • 1,310
  • 1
  • 8
  • 12