2

I have a small NodeJS program that I use to extract code comments from files I point it to. It mostly works, but I'm having some issues dealing with it misinterpreting certain JS strings (glob patterns) as code comments.

I'm using the regex [^:](\/\/.+)|(\/\*[\W\w\n\r]+?\*\/) to parse the following test file:

function DoStuff() {
    /* This contains the value of foo.
       Foo is used to display "foo"
       via http://stackoverflow.com
    */
    this.foo = "http://google.com";
    this.protocolAgnosticUrl = "//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/core.js";

    //Show a message about foo
    alert(this.foo);

    /// This is a triple-slash comment!

    const globPatterns = [
      'path/to/**/*.tests.js',
      '!my-file.js',
      "!**/folder/*",
      'another/path/**/*.tests.js'
  ];
}

Here's a live demo to help visualize what is and is not properly captured by the regex: https://regex101.com/r/EwYpQl/1

I need to be able to only locate the actual code comments here, and not the comment-like syntax that can sometimes appear within strings.

Chris Barr
  • 29,851
  • 23
  • 95
  • 135
  • 6
    Don't use regex to parse code. This applies to JS code in just the same way it applies to HTML or other grammars that are too complex for regex. Use a parser. There is a JS parser, written in JS, that can easily pick out the comments from the source code. http://esprima.org/ – Tomalak May 11 '17 at 17:39
  • @Tomalak esprima looks really cool, thanks for mentioning it. – Karl Reid May 11 '17 at 17:41
  • 1
    Then there is the classic http://stackoverflow.com/a/1732454/3136474 . The same advice applies here. – Dinei May 11 '17 at 17:42
  • Another JS parser that is well-tested and readily available: https://github.com/babel/babylon – Tomalak May 11 '17 at 17:42
  • And another one: http://lisperator.net/uglifyjs/ast – Tomalak May 11 '17 at 17:43
  • 1
    I am aware that this is not the ideal way to parse code, but this is working for 99% of our use cases so far and I've written regular expression to parse code comments in all the other languages we use. This is the old hold-out now. I appreciate the suggestions, but they are not helpful in answering the question I have right now. – Chris Barr May 11 '17 at 17:52
  • 2
    That's the common "but my regex almost works!!1" defense. It doesn't work. And you can't conceive all the edge cases that you can be confronted with in real-world code. If programming languages could be parsed with regex, language parsers would not the complex beasts they are. Despite the fact that you don't want to hear it, "Use the right tool for the job." is not actually an unhelpful response. – Tomalak May 11 '17 at 17:55
  • For what it's worth, I actually did end up using Esprima in the end. It Turned out to be much simpler than I expected. I still use regex parsing for all other languages, and Esprima for Javascript & Typescript. – Chris Barr May 17 '17 at 13:13

1 Answers1

2

I have to agree with the comments that for most cases it is better to use a parser, even when a RegExp can do the job for a specific and well defined use case.

The problem is not that you can't make it work for that very specific use case even thought there are probably plenty of edge cases that you don't really care about, nor have to, but that may break that solution. The actual problem is that if you start building around your sub-optimal solution and your requirements evolve overtime, you will start to patch those as they appear. Someday, you may find yourself with an extensive codebase full of patches that doesn't scale anymore and the only solution will probably be to start from scratch.

Anyway, you have been warned by a few of us, and is still possible that your use case is really that simple and will not change in the future. I would still consider moving from RegExp to a parser at some point, but maybe you can use this meanwhile:

(^ +\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^ +\/\*([\W\w\n\r]+?)\*\/)

Just in case, I have added a few other cases, such as comments that come straight after some valid code: enter image description here

Edit to prove the first point and what is being said in the comments:

I have just answered this with the previous RegExp that was solving just the issue that you pointed out in your question (your RegExp was misinterpreting strings containing glob patterns as code comments).

So, I fixed that and I even made it able to match comments that start in the same line as a valid (non-commented) statement. Just a moment after posting that I notice that this last feature will only work if that statement contains a string.

This is the updated version, but please, keep in mind that this is exactly what we are warning you about...:

(^[^"'`\n]+\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^[^"'`\n]+\/\*([\W\w\n\r]+?)\*\/)

enter image description here

How does it work?

There are 4 main groups that compose the whole RegExp, the first two for single-line comments and the next two for multi-line comments:

  • (^[^"'`\n]+//(.*))
  • (["']+.*["']+.//(.))
  • (["']+.*["']+.*/*([\W\w\n\r]+?)*/)
  • (^[^"'`\n]+/*([\W\w\n\r]+?)*/)

You will see there are some repeated patterns:

  • ^[^"'`\n]+: From the start of a line, match anything that doesn't include any kind of quote or line break.

    ` is for ES2015 template literals.

    Line breaks are excluded as well to prevent matching empty lines.

    Note the + will prevent matching comments that are not padded with at least one space. You can try replacing it with *, but then it will match strings containing glob patterns again.

  • ["']+.*["']+.*: This is matching anything that is between quotes, including anything that looks like a comment but it's part of a string. Whatever you match after, it will be outside that string, so using another group you can match comments.

Danziger
  • 19,628
  • 4
  • 53
  • 83
  • 1
    This is great, thank you! In your second example note that the highlighting on the regex101 website is a bit misleading because the capture groups actually are correct in most of those cases. For my purposes I would use the capture groups instead of the entire matching line – Chris Barr May 11 '17 at 18:43