1

I'm a nodejs newbie and was wondering which way was better to insert huge number of rows into a DB. On the surface, it looks like inserting stuff one-at-a-time looks more like the way to go because I can free the event loop quickly and serve other requests. But, the code looks hard to understand that way. For bulk inserts, I'd have to prepare the data beforehand which would mean using loops for sure. This would cause less requests to be served during that period as the event loop is busy with the loop.

So, what's the preferred way ? Is my analysis correct ?

user2103008
  • 414
  • 7
  • 19

2 Answers2

1

There's no right answer here. It depends on the details: why are you inserting a huge number of rows? How often? Is this just a one-time bootstrap or does your app do this every 10 seconds? It also matters what compute/IO resources are available. Is your app the only thing using the database or is blasting it with requests going to be a denial of service for other users?

Without the details, my rule of thumb would be bulk insert with a small concurrency limit, like fire off up to 10 inserts, and then wait until one of them finishes before sending another insert command to the database. This follows the model of async.eachLimit. This is how browsers handle concurrent requests to a given web site, and it has proven to be a reasonable default policy.

Peter Lyons
  • 142,938
  • 30
  • 279
  • 274
0

In general, loops on in-memory objects should be fast, very fast.

I know you're worried about blocking the CPU, but you should be considering the total amount of work to be done. Sending items one at time carries a lot of overhead. Each query to the DB has its own sequence of inner for loops that probably make your "batching" for loop look pretty small.

If you need to dump 1000 things in the DB, the minimum amount of work you can do is to run this all at once. If you make it 10 batches of 100 "things", you have to do all of the same work + you have to generate and track all of these requests.

So how often are you doing these bulk inserts? If this is a regular occurrence, you probably want to minimize the total amount of work and bulk insert everything at once.

The trade-off here is logging and retries. It's usually not enough to just perform some type of bulk insert and forget about it. The bulk insert is eventually going to fail (fully or partially) and you will need some type of logic for retries or consolidation.

If that's a concern, you probably want to manage the size of the bulk insert so that you can retry blocks intelligently.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • Well, I don't do it that often. I'll go with bulk inserts for now. What do you mean bulk inserts will eventually fail ? – user2103008 Sep 03 '13 at 19:22
  • Eventually you will attempt a bulk insert and it will fail for some reason. Bad data, network failure, server error, etc. In the case of bulk insert you need to be extra-aware because you need to identify which data succeeded and which data failed. For some database it's "all or none", for others you may get a partial success meaning that your next update would "fill in the blanks". Some DBs have "INSERT OR MERGE" semantics so that you can safely retry the same batch if part of it fails. These would be _my_ big concerns. – Gates VP Sep 04 '13 at 19:35