5

I have a NW.js app that simply (recursively) scans a directory tree and get the stats for each file/directory. It also performs a MD5 for files.

I have 29k files, 850 folders, all for 120GB data.

After almost 7 minutes, my code only scanned 4080 files over the 29k files.

How is it possible that it is so slow?? Is there something I can do to improve performance? Otherwise, Node would be useless to me...

What is surprising, is that it took "only" 7 seconds to scan 1k files. Why is it 60 times longer to scan only 4 times as much files?

When I check the processes, I can see that Node moves a lot in RAM usage: from 20MB to 400MB (it fluctuates both ways). But the CPU usage is stuck at 1%.

It is weird, because I don't think I am allocating so much RAM. Actually, I don't allocate anything! Please see my code below.

if (process.argv.length < 3)
    process.exit();


var fs = require('fs');
var md5 = require('md5');
var md5File = require('md5-file');

var iTotal = 0;
var iNbFiles = 0;
var iNbFolders = 0;

var iBegin = Date.now();

var App =
{
    scan: function(path)
    {
        var items = fs.readdirSync(path);
        var i, item, stats, fullPath, isFolder, fileMD5;
        var len = items.length;
        var md5Hash = md5(path);

        for (i = 0; i < len; i++)
        {
            item = items[i];
            fullPath = path + '/' + item;
            stats = fs.statSync(fullPath);
            if (stats.isSymbolicLink())
                continue;

            isFolder = stats.isDirectory();
            if (!isFolder)
            {
                fileMD5 = md5File(fullPath);
                iNbFiles++;
            }
            else
            {
                fileMD5 = null;
                iNbFolders++;
            }

            iTotal++;
            process.send({_type: 'item', name: item, path: path, path_md5: md5Hash, full_path: fullPath, file_md5: fileMD5, stats: stats, is_folder: isFolder});
            if (isFolder)
                App.scan(path + '/' + item);
        }

        process.send({_type: 'temp', total: iTotal, files: iNbFiles, folders: iNbFolders, elapsed: (Date.now() - iBegin)});
    }
};

App.scan(process.argv[2]);

// Send the final and definitive value of "total"
process.send({_type: 'total', total: iTotal, files: iNbFiles, folders: iNbFolders});

process.exit();
Lideln Kyoku
  • 952
  • 9
  • 20

1 Answers1

2

Use any node module like https://github.com/jprichardson/node-klaw

Actually, I don't allocate anything!

Nop, you are allocating: any object creation or variable creation will allocate memory for this. Also, md5-file read each file via stream and calculate checksum. So, you need to send all content of all your files throw CPU and memory. You use sync version of MD5 - in one time it will calculate only one file. Also, you have recursion there and you have many files: it mean, when stack will end - you will have error. And I think you have this error - you just don't see it or you run it silently without any progress feedback. Use async directory read and async MD5 calculation. Best solution is use some collection of workers processes (for example 6 core CPU - 6 workers) and pull data to this this workers and they will calculate the MD5.

Update 1

Recursion memory leak example:

var i=0;
function inc() {
    i++;
    var s = ""
    for(var n=0;n<4000;n++){ s+="0123456789" }
    inc();    
}
inc();

Open task manager and run this code in browser - and you will see how fast memory consumption is grow.

VoidVolker
  • 969
  • 1
  • 7
  • 12
  • 1
    Thanks for your answer @VoidVolker. However, when I say "I don't allocate anything", I meant that I dont think I have a memory leak, because I just have variables that I reset on each loop iteration, nothing fancy here. Also, even though I admit that having 6 workers instead of 1 would be faster, it doesn't explain why it takes 7 seconds for 1k files, and 7 minutes for 4k files... Which is my real issue here actually, because otherwise it would "only" take 3:30 to scan the whole disk, which is acceptable (for now) – Lideln Kyoku Mar 04 '16 at 16:03
  • But you have recursion. And function resources will be free when it exits (if I remember correct). So, on each new recursion call you will have new resources allocated and old will not be free until last function will complete execution. Take a look to this answer: http://stackoverflow.com/questions/7826992/browser-javascript-stack-size-limit – VoidVolker Mar 04 '16 at 16:48
  • 1
    I would be really surprised if it came from this. I modified my script to use the - otherwise excellent - fs-extra lib you advised me, and I only gained 4% performance (and I don't think I did enough tests for that difference to be meaningful anyway). Maybe I'll focus on dispatching this calculation to several processes... I tried Go, and although it is 56% faster for 1k files, it is surprisingly slower for 3k+ files – Lideln Kyoku Mar 05 '16 at 00:22
  • Check the code above - it show memory leak in recursion. Try next solution: not use recursion. Build linear array with files. Then use `for(...){ ...}` loop to calculate MD5. – VoidVolker Mar 05 '16 at 06:09
  • Actually I think I'll use the proper tool for each need: I will build the app with nw.js, scan the directories with Go (which is 4x faster than Node for that purpose), and compute the MD5 hash using OpenSSL binary. Thanks anyway @VoidVolker for your help! – Lideln Kyoku Mar 05 '16 at 15:23
  • Yeah, nwjs is awesome, feel free to ask your questions in official chat nwjs/nwjs at gitter.im – VoidVolker Mar 05 '16 at 15:57
  • 1
    For anyone that's finding this question/answer after me, the `fs-extras` referenced above no longer includes "walk" so that deep link won't take you to anything other than the top of the page. "walk" has been moved to its own package [node-klaw](https://github.com/jprichardson/node-klaw) – Laura Jan 05 '23 at 22:57