1

when I try to update 5000 rows using knexjs i get the error Timeout acquiring a connection. The pool is probably full.".

when I looked at CPU usage. I found the postgres pids are taking always 90-98% CPU usage and this is not normal, I tried by destroy() at every kenx, I do but it destroys the connection and didn't solve it

this is the code I'm using

const knexDb = knex({ client: 'pg', connection: {
    host : '127.0.0.1',
    user : process.env.DB_USER,
    password : process.env.DB_PASSWORD,
    database : process.env.DB_DATABASE,
    port: process.env.DB_PORT
  }});

arrayWith5ThousandObj.map(data => {
    knexDb('users').where({
      user: data.user,
    })
    .update({
      product: data.product
    })
    .catch(err => console.error('update user products', err))
})

this is a loop function that repeats itself every 1 minute, and I tried also .finally -> knexDb.destroy() , but it destroys the connection and I get the error couldn't acquire a connection.

I want to update 5000 rows constantly or more like 10,000+ using knexjs, and I think PostgreSQL can handle this other wise how large website that does like 10s of thousands of queries every min without having a problem. the problem is not in the server as the server has 10 CPU and 16gb of RAM, so resources isn't an issue, I stop all running processes on the server except this app. postgres pid didn't use CPU almost at all. so the problem in a large number of queries that happens. Is there a bulk update that I can update all 10,000+ rows at once using knexjs ?.

I have tried this solution recently

return knexDb.transaction(trx => {
    const queries = [];
    arrayWith5ThousandObj.forEach(data => {
        const query = knexDb('users')
            .where({
              user: data.user,
            })
            .update({
                product: data.product,
            })
            .transacting(trx); // This makes every update be in the same transaction
        queries.push(query);
    });

    Promise.all(queries) // Once every query is written
        .then(trx.commit) // We try to execute all of them
        .catch(trx.rollback); // And rollback in case any of them goes wrong
});

but i get this error:

{ error: deadlock detected
   at Connection.parseE (/*********/connection.js:601:11)
   at Connection.parseMessage (/*********/connection.js:398:19)
   at Socket.<anonymous> (/**********/connection.js:120:22)
   at Socket.emit (events.js:189:13)
   at addChunk (_stream_readable.js:284:12)
   at readableAddChunk (_stream_readable.js:265:11)
   at Socket.Readable.push (_stream_readable.js:220:10)
   at TCP.onStreamRead [as onread] (internal/stream_base_commons.js:94:17)
 name: 'error',
 length: 340,
 severity: 'ERROR',
 code: '40P01',
 detail:
  'Process 9811 waits for ShareLock on transaction 443279355; blocked by process 9808.\nProcess 9808 waits for ShareLock on transaction 443279612; blocked by process 9811.',
 hint: 'See server log for query details.',
 position: undefined,
 internalPosition: undefined,
 internalQuery: undefined,
 where: 'while locking tuple (1799,4) in relation "users"',
 schema: undefined,
 table: undefined,
 column: undefined,
 dataType: undefined,
 constraint: undefined,
 file: 'deadlock.c',
 line: '1140',
 routine: 'DeadLockReport' }
sasha romanov
  • 485
  • 1
  • 10
  • 26
  • Sounds as if Node.js or Knex opens a new connection for each row. –  Jun 17 '19 at 23:43
  • @a_horse_with_no_name so how do you suggest to solve this issue – sasha romanov Jun 17 '19 at 23:45
  • 1
    Bluebird map series to limit concurrent promises. You are currently starting thousands of connections essentially simultaneously. – Gangstead Jun 18 '19 at 00:42
  • @Gangstead can you give me an example. using the code above – sasha romanov Jun 18 '19 at 00:50
  • @Gangstead i used promise to resolve every update quiry after 1 sec , but this isn't functional as if i'm going to update 10s of thousands of data , this will take for ever, is there anyway doing the quiries without promises. so it updates all the rows in less than a sec – sasha romanov Jun 18 '19 at 03:34
  • Are you able do the update in a single update statement? Where does arrayWith5ThousandObj come from? If you can't do it in a single update statement you could try to insert arrayWith5ThousandObj in a temp or normal table and then do the update with SQL only. – Mika Sundland Jun 18 '19 at 07:15
  • @MikaS how do i make update in a single update statement , i don't see any docs on how to do that with kenxjs , and i tried searching it , but sadly no result , can you help me by providing me an example on how to do multiple row updates in one single query – sasha romanov Jun 18 '19 at 18:41

2 Answers2

2

Using bluebird.map to control concurrency:

knex.transaction((trx) => {
    Bluebird.map(arrayWith5ThousandObj, (data) => {
            return trx('users')
                .where({
                    user: data.user,
                })
                .update({
                    product: data.product,
                }))

    }, { concurrency: 5 })
    .then(trx.commit);
})
.then(() => console.log('all done'));

In your initial solution you generate 5000 promises at once which all try to connect to the database at once. This solution will ensure there are at most X concurrent promises and does not use delays, you can fine tune the number for your solution. Knex default is max 10 connections.

Gangstead
  • 4,152
  • 22
  • 35
  • i get this error deadlock detected and after a while i get pool is probably full error – sasha romanov Jun 18 '19 at 22:33
  • can you explain to me what concurrency: 5 means , and what it's role – sasha romanov Jun 18 '19 at 22:45
  • i have tried your solution , and it works, but i see postgreSQL pids CPU usage 98% now all the time especially this functions loops every 2 min , the server CPU are 10-20 % and it's normal but postgreSQL CPU are always high by using this method. – sasha romanov Jun 19 '19 at 03:23
  • With ten thousand plus individual updates every 2 minutes there's going to be some high CPU usage. As MikaS said you'll have to do it in a single (knex.raw(...)) update statement or bulk insert into a temp table and run an update off of that. Possible starting point for the bulk update statement: https://stackoverflow.com/a/40555000/1637003 – Gangstead Jun 19 '19 at 15:42
  • I've created a table called temp_table and created a function in psql that on INSERT it updates the data in the real table ('users') , but i get this error error: deadlock detected. i stopped all running functions on this table , but i get this error from the function in psql it self. how is that ? – sasha romanov Jun 19 '19 at 18:12
  • ok now the CPU is surging to 100% and it takes forever to update even 10,000 row, is there a way to update slice the maping data 10 by 10 to be updated till they all updated.I know this could take lots of time but at least CPU usage wouldn't be so high as if 10 core server can't update 10,000 row. there must be something wrong here. – sasha romanov Jun 20 '19 at 06:12
1

Knex is not really a correct tool for that kind of huge bath updates. Specially in a ways that you are using it is especially non-performant.

When initializing 5k query builders, all the builders are created compiled and executed at the same time, but when using transaction, all the queries are sent through single connection.

So all the updates are sent serially to the DB server anyways and there is 0 concurrency on those updates.

So there are 5000 knex objects which are compiled and 5000 SQL queries with bindings sent to DB driver and then they are buffered by driver and sent one by one to the server.

That should not cause deadlocks though... so you have probably some other problems in your code too.

You could try using smaller batches in multiple transcations if it doesn't matter that all of the data is not reverted when there is a single error in queries... Actually I don't understand why that kind of data update needs to be done in transaction if single rows could be resent/logged if there were problems with those.

My best advice would be to set batch sizes, connection pool size and connection limits from database server to match the workload that you are pushing to the server.

see postgreSQL pids CPU usage 98% all the time

If you are doing huge amount of updates through single transaction it is really improbable that it is causing the high CPU usage. You should log in to that SQL server and check out what kind of queries it is executing during that work load... Maybe you are running by accident that same update code multiple times parallel in different transcactions, which could explain also the deadlock issues.

Batch updates are pretty problematic with SQL, since single update statement can update only one row. One way to run multiple updates in single query could be using CTE queries https://www.postgresql.org/docs/current/queries-with.html

That way you could build batch of update queries and add them as prequeries for the main query https://knexjs.org/#Builder-with all of those queries are then run as atomic operation in DB so transaction should not be needed to assure that either whole batch or nothing goes in.

Mikael Lepistö
  • 18,909
  • 3
  • 68
  • 70