0

First i want to describe the problem: i have a directory tree (depth = 3) which contains several directories and files. Some of those files have .txt extension and some .mp4. I want to copy only .mp4 files in new directory with same hierarchy as it was in source directory (in other words, i don't want to copy all mp4 files in one folder, i want to copy all directories as it is and then copy mp4 files). The question is: how to copy those files not in series, but in parallel? Here is my code:

const fs = require('fs');
const path = require('path');
const { promisify } = require('util');

const sourceDir = process.argv[2];
const stat = promisify(fs.stat);
const copy = promisify(fs.copyFile);
const mkdir = promisify(fs.mkdir);
const readdir = promisify(fs.readdir);
const targetDir = path.join(__dirname, 'only-mp4');

// creating root folder, all files will be copied here
(async () => {
  await mkdir(targetDir);
})();

const copyMediaFiles = async (node) => {
  try {
    const stats = await stat(node);
    if (stats.isDirectory()) {
      let children = await readdir(node);
      // constructing new paths
      children = children.map((child) => path.join(node, child));
      // "copying" file hierarchy (basically just recreating same file hierarchy in target directory)
      children.forEach((child) => {
        const courseDirs = child.split('/').slice(4, 7).join('/');
        mkdir(path.join(targetDir, courseDirs), { recursive: true });
      });
      // running this function for all children recursively in parallel
      const promises = children.map(copyMediaFiles);
      await Promise.all(promises);
    }
    const ext = path.extname(node);
    const filename = path.basename(node);
    // if file extension == mp4 then copy that file in target directory
    if (ext === '.mp4') {
      await copy(
        node,
        path.join(
          targetDir,
          path.dirname(node).split('/').slice(4).join('/'),
          filename
        )
      );
      console.log('File copied: ', filename);
    }
    return;
  } catch (error) {
    console.log(error);
  }
};

copyMediaFiles(sourceDir).then(() => console.log('All mp4 files copied'));

Yes, it's working, but i'm not sure i did it right. Any advise one that? What i did wrong here? And i'm not sure i'm traversing this tree right.

  • Your program already *does* run concurrently - can you share the log outputs that make you assume it doesn't? The only problems I can see is that the `if` is missing an `else` (it only works because none of your directories have names that end in `.mp4`), and that [you should not use `forEach`](https://stackoverflow.com/q/37576685/1048572) for the `mkdir` calls – Bergi May 04 '20 at 15:48
  • yeah i just realized that i spent several hours writing this function and i'm a bit tired and maybe that is the reason i think it doesn't work correctly or it is written bad, idk. Just want to hear feedback. thx for pointing forEach problem – wealmostdone May 04 '20 at 16:18
  • but in the end of the day i'm not sure that doing this task in async way was good decision, because i still have to wait copying file..so its running in series. So if its waiting, its same as sync (in matter of time) – wealmostdone May 04 '20 at 16:22
  • We can't tell whether it's working or not if you don't tell us what the problem is. Please put a `console.log('Handling', node)` in the first line of the function, then post the log output. – Bergi May 04 '20 at 21:20

1 Answers1

1

Two issues:

  • The first call of copyMediaFiles will happen before the first mkdir promise resolves. This is risky, as you may actually try to access the target directory before its creation completed. If you would just put the call of copyMediaFiles inside the async IIFE, then you don't have that risk:

    (async () => {
        await mkdir(targetDir);
        await copyMediaFiles(sourceDir);
        console.log('All mp4 files copied');
    })();
    
  • The second call of mkdir is made without capturing the promise it returns, so a similar risk occurs there as well.

A possible improvement to "crunch" a bit more:

Your aim is to minimise JavaScript idle time (waiting for promises to resolve), and that you can do by maximising the number of promises that are pending.

For that reason it would be better to initiate a call to copyMediaFiles right after the corresonding mkdir promise resolves, and not to initiate the creation of all sibling dirs first before making that call:

const children = await readdir(node);
const promises = children.map(async child => {
    child = path.join(node, child);
    const courseDirs = child.split('/').slice(4, 7).join('/');
    await mkdir(path.join(targetDir, courseDirs), { recursive: true });
    await copyMediaFiles(child);
});
await Promise.all(promises);

With this code you will potentially start a copyMediaFiles call before all sibling directories have been created. If your directories have a high branching factor, then this means you will get a longer list of pending promises, which could be beneficial to the overall performance.

All depends on how well the underlying API manages concurrency.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • "*which could be beneficial to the overall performance*" - or cause the exact opposite of the desired outcome due to bad cache locality. Depending on the device(s) copied from and to, the task may be ultimately limited by IO performance, which will be maxed out at some point and cannot be improved with more parallelisation. – Bergi May 04 '20 at 21:30
  • That's what I meant to convey with the last phrase of my answer. – trincot May 04 '20 at 21:33
  • @trincot thx for feedback, now i understand a bit more. Btw when i use Promise.all my laptop became reallyyy slooow, so i assume doing this in series would be better (for of loop with await). What do you think? – wealmostdone May 05 '20 at 04:45
  • Yes that alternative is indeed removing the concurrency of different api calls. If the api cannot manage such concurrency in a satisfactory manner, then that's the way to go. – trincot May 05 '20 at 05:41
  • @Bergi is right. Disk performance often drops dramatically when you're reading/writing to above a fixed number of files. https://www.npmjs.com/package/bottleneck/v/2.9.0 can be a very good way to get a tradeoff between maximizing parallelism and limiting disk thrashing. – btilly May 05 '20 at 15:56