0

Looking to scrape the comments out of a JS file. Was thinking I can create a function to input a .js file, perform a RegExp match, and output an array of strings using fs.readFile() and string.match();

Here's an over-simplified example:

I have two files class.js (to read) and parse.js (to perform the text parsing)

class.js:

/*
    by: Mike Freudiger
*/

/**
* one
* @returns 'Hello World'
*/
function one () {
        return 'Hello World';
}

alert();

/* end of file */

parse.js:

var fs = require('fs');

var file = fs.readFile('C:\\Users\\mikef\\Desktop\\node_regex_test\\class.js', 'utf8', function(err, doc) {
    var comments = doc.match(/(\/\*\*(.|\n)+?\*\/)/g);
    console.log(comments);
});

when I run node parse.js the console output is null.

However when I run the regex match on a multiline string, I get the expected output:

var doc = `/*
        by: Mike Freudiger
    */

    /**
    * one
    * @returns 'Hello World'
    */
    function one () {
            return 'Hello World';
    }

    alert();

    /* end of file */`

Any idea why the readFile() string would behave differently than a string literal?

...Also, I realize there may be a better way to get these comments out, with another npm package or something, but now I really just want to know why these two strings are different.

mikefreudiger
  • 11
  • 2
  • 4

1 Answers1

1

As mentioned by vsemozhetbyt, it seems that newlines used in class.js file are either \r\n or \r.

One of the simplest (and fastest) way to match these newlines would be to use [\s\S] instead of (.|\n) in your regex.

Thus you get:

var fs = require('fs');

var file = fs.readFile('C:\\Users\\mikef\\Desktop\\node_regex_test\\class.js', 'utf8', function(err, doc) {
    var comments = doc.match(/(\/\*\*[\s\S]+?\*\/)/g);
    console.log(comments);
});
mr.mams
  • 425
  • 4
  • 10
  • 1
    That works! That obviously means that new lines are represented differently in a file than a multi line string? Could anyone point to documentation? – mikefreudiger Feb 12 '19 at 00:09
  • 1
    Template literals always use `\n`, even though you copy a text from a `\r\n` file and paste it in the source code in a template literal, after the parsing it will contain `\n` separators. – vsemozhebuty Feb 12 '19 at 00:19
  • 1
    You can check it here: http://exploringjs.com/es6/ch_template-literals.html#_line-terminators-in-template-literals-are-always-lf-n . Or you can also check the more formal description from ECMAScript specification here : http://www.ecma-international.org/ecma-262/6.0/#sec-static-semantics-tv-and-trv – mr.mams Feb 12 '19 at 00:22