0

Is there an existing algorithm to find all literal Regular Expression occurrences within a single line of valid JavaScript code?

Given that a literal Regular Expression cannot be multi-line, I need to detect all regular expressions within a single line of code, or more specifically - the beginning and end indexes for each regular expression, if they are present.

function enumRegex(textLine) {
    // magic happens here;
}

var testLine = 'var regEx1 = /one/; regEx2 = /two/;';

console.log(enumRegex(testLine));

Expected output: Array of index pairs (start and end index for each RegEx found):

[{13,17},{29,33}]

UPDATE: After playing with this: Is there a regular expression to detect a valid regular expression?, I'm not sure it would even work. So, if someone suggests using a regular expression to detect regular expressions, it would require an example that actually works. I'd rather hope to see an algorithm.

Community
  • 1
  • 1
vitaly-t
  • 24,279
  • 15
  • 116
  • 138
  • 1) Use a regex to extract what is between / 2) Feed all extracted strings to RegExp constructor and check if the result is a regex object (it will throw a SyntaxError if what you feed it is not a valid regex). – kliron Dec 26 '15 at 12:42
  • 1
    As for [finding string literals](http://stackoverflow.com/questions/34461781/finding-text-strings-in-javascript#34461862), you will need a parser to get reliable results. – GOTO 0 Dec 26 '15 at 12:43
  • @kliron see the update in my question, please. – vitaly-t Dec 26 '15 at 12:51
  • Yes, there is an algorithm, the one found in all JS parsers. –  Dec 26 '15 at 12:52
  • Take a look at [the grammar for regular expression literals](http://www.ecma-international.org/ecma-262/6.0/#sec-literals-regular-expression-literals) and use a parser to parse them. That’s the only good way to find valid regular expression literals within a string. – poke Dec 26 '15 at 12:53
  • @torazaburo why won't you publish that algorithm as an answer then? – vitaly-t Dec 26 '15 at 12:54
  • @kliron It's more complicated than you make it sound, once you consider slashes starting and ending and inside comments, slashes inside quoted strings and template strings, and escaped slashes inside regular expressions. –  Dec 26 '15 at 12:55
  • 1
    The algorithm is already published in the source code of the parsers. –  Dec 26 '15 at 12:55
  • 1
    This question will eventually turn into “how does a parser work?” and that’s far too broad, as such I’m voting to close this question as off-topic. – poke Dec 26 '15 at 12:55
  • @torazaburo Of course it is. I didn't mean to write a general solution to an unsolvable problem in 1 comment line on SO. – kliron Dec 26 '15 at 12:58
  • @poke this is a very specific question that doesn't need to be turned into anything other than a correct answer. – vitaly-t Dec 26 '15 at 12:59
  • Is the single line of code totally random ? – Serge K. Dec 26 '15 at 13:06
  • @NathanP. Random, but always valid JavaScript code. I'm not considering invalid JavaScript. – vitaly-t Dec 26 '15 at 13:07
  • @vitaly-t And regex are always assigned to a variable, or also should be picked in function calls such as `str.match(/one/)` ? – Serge K. Dec 26 '15 at 13:10
  • @NathanP. any of those, no restriction. – vitaly-t Dec 26 '15 at 13:10
  • 2
    @vitaly-t A correct answer for this question would be “Write a parser.” That should give you enough of an idea to work with. Everything else (like, writing an actual parser *for you*) would be far too much effort and require far too much detail on our behalf (remember: we’re not here to write code for you). So if “write a parser” is not clear enough for you, then your question is essentially a “how does a parser work?” or “how do I write a parser?” and that’s too broad. – poke Dec 26 '15 at 13:10
  • @poke, that's why my question opened with: >Is there an existing algorithm to find... – vitaly-t Dec 26 '15 at 13:11
  • Parsers *are* existing algorithms that are also covered in thousand of books into crazy detail. – poke Dec 26 '15 at 13:16
  • @poke, in my specific case I believe it is possible to implement via RegExp, it is only a matter of how. – vitaly-t Dec 26 '15 at 13:21

1 Answers1

0

The only complication that stops you from doing /\/.+\/[a-z]*/g as your regex test is... well, it would fail to find itself, for starters. It doesn't like escaped backslashes.

Not a problem.

/\/(?:\\.|[^\/])+\/[a-z]*/g

So what does this regex do?

  1. Look for / - this is the start of a regex... probably!
  2. Look for one of more of either...
    • A backslash followed by any character (this is our escape-ignoring logic, note that "any character" excludes newlines)
    • OR any character that is not a forward slash
  3. Find the / that indicates the end of the regex
  4. Find any modifiers attached to the regex literal.

And... done! If this regex matches something, then it has found something that looks like a regex. However... that's not to say you've found a valid regex. For that, we need some validation.

First, let's capture the regex and modifiers:

/\/((?:\\.|[^\/])+)\/([a-z]*)/g

Now, for each match, we attempt to create a regex object from it:

isValid = true;
try {
    new RegExp(match[1], match[2]);
    // pass suspect regex as first argument, modifiers as second
}
catch(e) {
    isValid = false;
}

So your final code might look something like...

function enumRegex(textLine) {
    var parser = new RegExp("/((?:\\\\.|[^/])+)/([a-z]*)","g");
    // note that rules for escaping are very different in new RegExp than with literals

    var match, results = [];
    while( match = parser.exec(textLine)) {
        try {
            new RegExp(match[1],match[2]);
        }
        catch(e) {
            continue;
        }
        results.push(match[0]);
    }

    return results;
}

It is worth noting that this is far from flawless. Problems include:

  • var falseMatch = 'var string = "/trololol/";';
  • var falseMatch = '// comment line with a /regex-like substring/derp';
  • var falseMatch = 'var number = 8 / 2 / 2;'; (the / 2 / is seen as a regex)

These and more would require more content-aware parsing than a simple regex will allow.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • While I am evaluating your answer, I just wanted to highlight once again that I'm looking at valid JavaScript only, I'm not interested in invalid RegExp or invalid JavaScript at all. But thanks for considering those cases also! :) – vitaly-t Dec 26 '15 at 13:35
  • @vitaly-t That is something that simply cannot be done without the use of recursive regular expressions, something that JavaScript's engine does not support. The next best option is to find suspected regexes, and put them to the test with the built-in parser. – Niet the Dark Absol Dec 26 '15 at 13:38
  • In my code I am already separating text blocks, comment blocks and regular expressions from each other, so your solution is even closer to what I need than you think :) – vitaly-t Dec 26 '15 at 13:46
  • For general purpose it is a good answer that I have accepted. Thank you so much! – vitaly-t Dec 26 '15 at 13:47
  • I'm confused by why you would present an entire solution and then point out all the reasons it would not work. The correct answer would just be your last sentence. –  Dec 26 '15 at 17:34
  • @torazaburo It solves the problem with some caveats. If those caveats are addressed by other code (such as OP's assertion that they already separate text and comments out first), then it works again. – Niet the Dark Absol Dec 26 '15 at 18:06
  • And this is precisely what my code does, it separates comments and text, so the logic suggested here works fine afterwards. – vitaly-t Dec 29 '15 at 17:18
  • @NiettheDarkAbsol example where the code doesn't work: `func(1/2, /text/)`. I was expecting it work in cases like this, but it doesn't. – vitaly-t Dec 30 '15 at 06:33