3

I recently read about Node's "worker_threads" module that allows parallel execution of Javascript code in multiple threads which is useful for CPU-intensive operations. (NOTE: these are not web workers made by Chrome in the browser)

I'm building a feature where I need to do a massive amount of Postgres INSERTs without blocking the browser.

The problem is: in my Javascript files where I instantiate the worker, I'm not allowed to import anything, including native Node modules or NPM libraries like Knex.js which is necessary to do database queries. I get an error that says: Cannot use import statement outside a module as soon as the file is executed.

I've tried putting the worker code in another file with an import statement at the top (same error). I've tried giving the Knex object to workerData but it cannot clone a non-native JS object.

I'm out of ideas- does anyone know how to interact with a database in a worker thread if we can't import any NPM libraries?!?!

// mainThread.js

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

import knex from 'knex'; // --> *** UNCAUGHT EXCEPTION: Cannot use import statement outside a module ***

if (isMainThread) {
  module.exports = async function runWorker (rowsToInsert = []) {
    return new Promise((resolve, reject) => {
      const worker = new Worker(__filename, { workerData: { rowsToInsert } });

      worker.on('message', (returningRows) => resolve(returningRows));
      worker.on('error', reject);
      worker.on('exit', (code) => {
        if (code !== 0) reject(new Error(`Worker stopped with exit code ${code}`));
      });
    });
  };
} else {
  const { rowsToInsert } = workerData;

  return knex('table').insert(rowsToInsert)
    .then((returningRows) => {
      parentPort.postMessage({ data: returningRows });
    });
}

I am following a tutorial from this webpage: https://blog.logrocket.com/use-cases-for-node-workers/

Bruce Wang
  • 142
  • 2
  • 10
  • 2
    Yes, you can use workerThreads for that, but it is probably not required or useful. The database is, by itself its own process. So, all the nodejs process is doing is sending it network requests in a non-blocking, asynchronous fashion. nodejs is unlikely to be doing anything CPU intensive that would benefit from workerThreads. – jfriend00 Jul 15 '20 at 23:08
  • 2
    FYI, when writing workerThread code I find it 200% cleaner and clearer and easier to debug if you put your workerThread code in it's own file and do not mix main thread code with workerThread code in the same file. You should not need `if (isMainThread) {}` when doing it that way. Plus, the whole intellectual understanding of what code runs where is a lot simpler to follow when you separate them cleanly into two files. – jfriend00 Jul 15 '20 at 23:10
  • Thanks for this explanation! Makes sense to me! @jfriend00 – Bruce Wang Jul 16 '20 at 16:33
  • @jfriend00 By the way, if you have any other insight- what other tools are at my disposable when trying to do a high-volume insert using Node (e.g. up to 1 million rows per ingestion)? – Bruce Wang Jul 16 '20 at 16:57
  • 1
    nodejs is not your challenge there. The challenge will be doing 1 million inserts into a database. The database will be the bottleneck there, not nodejs. For further help on that you'd probably want to make a new question about the specific database you're using and what exactly you're trying to insert and then maybe someone who really knows that database can talk about the most efficient ways to get all that data into the database. Again, it's not really a nodejs issue, but how to optimally use the database. – jfriend00 Jul 16 '20 at 21:50

2 Answers2

15

It is of course possible, but it's a very bad idea.

Database drivers are already asynchronous and non-blocking of the JavaScript thread. Moving your insert calls to a separate thread as you propose will not only get you no performance gains, it will actually decrease overall performance because of the overhead involved with interthread communication:

  • Synchronization and message passing is not free
  • JavaScript uses structured cloning when moving data between threads. This means all your rowsToInsert must be copied, which is (relatively) expensive.

Generally, the only time it's really appropriate to use JS threads is when your JavaScript code is performing CPU-intensive work. The node docs say as much right at the top:

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They will not help much with I/O-intensive work. Node.js’s built-in asynchronous I/O operations are more efficient than Workers can be.

This means if you're doing a lot of parsing, math, or similar, it may be appropriate to do the work in a thread. However, simply shoveling data from one place to another (ie, I/O) is not a good candidate for a thread — after all, node's design is tuned to be efficient at this kind of work.

You don't say where your rowsToInsert come from, but if it's coming in from HTTP request(s), a thread is the wrong thing to use. However, if you're parsing eg a CSV or JSON file on the server, it may be worthwhile to do that in a thread, but it's important that the thread does all the work (so memory need not be moved between threads). The message you post to the worker should just be "process the file located at /foo/bar.csv", and then the worker thread does the rest.


The error you're getting is the same that you'd get without worker threads: you're trying to use import in a regular, non-module JS file. Either rename the worker file to *.mjs or use require('knex') instead.

node's ES module documentation goes into detail about how import differs from require.

josh3736
  • 139,160
  • 33
  • 216
  • 263
  • Thanks for this amazing answer! It's super clear and your tip to use `require` worked. So if I wanted to do a high-volume insert, what tools are available? Should I use Node clusters? I'm googling "high volume inserts node" or "node cluster postgres inserts" but not getting a clear answer. – Bruce Wang Jul 16 '20 at 16:54
  • It depends on where your data comes from. Where does it come from? – josh3736 Jul 16 '20 at 18:51
  • The data comes from parsing multiple CSV files in Node. I iterate over the data and create an array of JS objects for insertion. Currently I'm chunking the inserts by 10K rows each chunk. And I'm running the Promises to insert them into Postgres one after another (as opposed to at the same time with Promise.all) – Bruce Wang Jul 21 '20 at 21:44
  • 1
    @BruceWang: If your input data is already well-formed, your best bet might be [`COPY`](https://www.postgresql.org/docs/current/sql-copy.html), which natively imports CSV to a table. [pg-copy-streams](https://github.com/brianc/node-pg-copy-streams) will help. However, if you're doing a lot of parsing and transformation in node, then it *might* make sense to do the job in a worker thread or process, but I'd profile to find the real bottleneck(s) first. Also look in to [prepared statements](https://node-postgres.com/features/queries#prepared-statements). – josh3736 Jul 21 '20 at 22:53
  • 1
    …and generally, read [the pg docs' performance tips](https://www.postgresql.org/docs/12/populate.html). – josh3736 Jul 21 '20 at 22:56
  • I think the answer could be a bit more nuanced. Unless what you're doing is trivial, there's usually some CPU usage involved before inserting something, to validate and bring the data into the right shape. Parsing CSV is not that fast. Also sometimes, you might want to isolate your connection pools from one another, so the big bulk import ones don't monopolize your regular pool used by http endpoints. – AlexG Oct 20 '21 at 17:23
  • @AlexG: I think the answer already conveys the nuance. Reread it, particularly paragraphs 3 and 4, which specifically mention parsing things like CSVs. Re connection isolation: if that's a concern, you're probably better off isolating your HTTP server and bulk jobs not with threads but in different processes (and probably even different machines), so that a crash caused by the bulk job doesn't bring the HTTP server down with it. – josh3736 Oct 20 '21 at 21:06
0

Cannot use import statement outside a module

This is just a complaint that your code is using the "import" style of imports, but "import" is only supported inside ES modules and your code is in a CommonJS context. In CommonJS code you need to use require() instead:

const knex = require('knex');
thenickdude
  • 1,640
  • 1
  • 17
  • 18