0

Note, the goal here is not lexical analysis so please do not suggest lexing or parsing code. And, my apology for adding to the mess of "regex comments" questions but the best (most voted) bad answer (given the context of how the result would be used based on the question) is inadequate, (though I was able to start from there) and many of the other answers I've reviewed are simply irrelevant to what I'm trying to do.

I've built a regex which works in principle as expected here.


/(?:\n|^)(?:[^'"])*?(?:'(?:[^\\\r\n]|[\\]{2}|\\')*'|"(?:[^\\\r\n]|[\\]{2}|\\")*")*?(?:[^'"])*?(\/\*(?:[\s\S]*?)\*\/)/g

The final group matches block comments well, as reference in the above SO:

(\/\*(?:[\s\S]*?)\*\/)

Everything preceding the actual match is discarded, but used for the purpose of matching a valid block comment - i.e. not something found in a string literal.

Ignore the case where a regex can look like a block comment.

Assume that the input string is linted, not free-form javascript.


But in practice, I'm getting a duplicate on the first match and no other matches.

Why? And how might it be corrected to work in practice?

Thanks in advance for your help and any trouble the question may put you through. :)

Also (in the comments section) any potential pit falls are welcome, given the information below.

Extra information irrelevant to the direct question: The ultimate goal, as hinted in the example code, is to replace/collapse any nested or otherwise code structures in such a way so as to focus on the variable declarations at the top of the lexical scope for a given patch of code - for the purpose of hoisting variable declarations, to generate a template for a specific use case. I know that sounds like a load, but I believe it is possible and relatively straight forward - NOT ENTIRELY WITH SIMPLE REPLACEMENT - but none the less. For reference to what I mean by "possible", I would prefer to only collapse regexs, block comments and inline comments EDIT: and string literals /EDIT, then recursively collapse only variable scopes (or plain objects) in {blocks} (all of them which do not contain any nested blocks) until they are gone, then see what's left. If it seems like this won't work for any reason, please respond only in comments. Thank you!

Community
  • 1
  • 1
Nolo
  • 846
  • 9
  • 19
  • You would have to look at the _top_ level parser .. code. If it does C/C++ comments style first, does it exclude quotes or not. Is it possible html can get in the way? –  Jul 07 '15 at 01:16
  • @sln, String literals yes good point, I'll edit that in. And html, there will not be any. – Nolo Jul 07 '15 at 01:18
  • I can give you a bullet-proof regex that does _all_ C/C++ comment processing. Is that what you need? –  Jul 07 '15 at 01:18
  • As long as this is JavaScript only, i.e. no html, it will work. Is that the case? –  Jul 07 '15 at 01:24
  • @sln Yes, given that regex literals should be out of the way first, but that's another issue - basically the part of this regex that ins't quite working would also serve that purpose. But in any case, not html present in the code - also given that string literals will be gone. :) – Nolo Jul 07 '15 at 01:28
  • @torasaburo Please read. :) The ultimate goal is to build a simple tool for generating a template with variable declarations (only those at the top level of the lexical scope) to be hoisted to the top of a template - it's a specific case. And I want to do it without actually lexing or parsing. – Nolo Jul 07 '15 at 01:31
  • It doesn't matter if string literals are gone. Its easier to do it _with_ literals, but works with or without. In the rx I posted, even if a malformed literal were left, it would go past it. The comment is the star, not the literals. –  Jul 07 '15 at 01:41
  • I fail to understand why your "ultimate goal" precludes parsing. With parsing, your project becomes almost trivial. Whatever regexp you write will in the end never work completely correctly or handle all the cases you want--it can't because of the nature of computer languages and the ability of regexp to handle different degrees of complexity. In fact, with the approach you're taking you're already stuck--which is why you're posting to SO. Other editors and IDEs analyze code for display or summarization purposes exactly by parsing it--why do you imagine that is? –  Jul 07 '15 at 01:51
  • @torazaburo I just wanted a few simple lines of code, this way I can do that. If I include a lexer and a parser, well that's code that I just don't want to look at right now, nor for this purpose. I.e. I didn't want it to be a "project". – Nolo Jul 07 '15 at 01:56
  • @Nolo I've guess you've made up your mind, but just a couple of points. You don't have to "look at" anything. All you have to do is include one line in your program which completely parses your Javascript, then do a straightforward traversal of the parse tree to remove block comments. The code involved wont' be much longer than the regexp you end up with, and it will be readable, maintainable, extensible, and functional. Take a look at esprima if you haven't already. –  Jul 07 '15 at 02:01

1 Answers1

1

This is one of those "ugh, yeah, of course!" moments.

The exec() function will generate an array with 1 element, being the matched element. Except it doesn't, the first element is the full match, which is great unless there are capture groups. If there are, then in additional to result[0] being the full pattern match, result[1] will be the first capture group, result[2] the second, and so on.

For example:

  1. (/l/g).exec("l") gives us ["l"]
  2. (/(l)/g).exec("l") gives us ["l", "l"]

You RE isn't so much the problem (although running the string through a stream filter that takes out block comments is probably easier to work with) as it's more a case of the assumption that you can just use .join() on the exec results that's been causing you problems. If you have capture groups, and you have a result, join results.slice(1), or call results.splice(1,0) before joining to get rid of the leading element, so you don't accidentally include the full match.

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
  • Hmm, ok, I tried string match, and that works better, but still not discarding (?:), why would that be? https://jsfiddle.net/375t3cLL/4/ – Nolo Jul 07 '15 at 01:37
  • Try http://jsbin.com/xelamotabo/edit?html,js,output, although this suggests your RE is not doing the continued match quite right (it's getting the `var b = function(){` part, for instance, and the "invalid" part on the last section) – Mike 'Pomax' Kamermans Jul 07 '15 at 01:43
  • Yeah, I'm onto it. Thanks Mike, you get the gold star for today. :) I'll remember .exec() in the future. – Nolo Jul 07 '15 at 01:46