In my company, our application runs on NodeJS over multiple EC2 instances and one RDS database.
Our application needed some upgrades as some dependencies were already quite old, and one of the upgrades we did that called our attention was updating our database libraries: mysql (from 2.16.0 to 2.17.0), knex (from 0.12.2 to 0.19.1) and bookshelf (0.10.2 to 0.15.1).
After checking the changelog, no code changes were required, so we quickly managed to upload it to our staging server.
Suddenly our application went too slow. All data took several seconds to load, and our main user's dashboard, which loads in a few miliseconds on the same server, took about 30 seconds. After a few minutes, the whole application was completely unresponsive.
In order to check if the problem was only involved with the dependency upgrade, we've managed to downgrade these to the working versions, and the application went back to the normal speed. Upgraded again, slow again.
We've started analyzing through New Relic if there was something wrong on the RDS side. Nothing at all. No peaks, no high usage of CPU, no slow queries or anything else. Then we came to check the connection pooling and discovered that the knex version that works for us uses "generic-pool", whereas the new version uses "tarn".
So we started debugging the pool and found out that it gets filled of a specified query, freezes completely for some time then starts throwing the "TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full" error.
But what's most interesting about the query that fills all the pools and freezes then is that it should not be generated at all (and is not generated when using the outdated version that does not face this problem).
In our application, we only do SELECT requests against the contacts table in two ocasions:
First, obviously, is when the users want to list their contacts:
let contacts = await Contacts.forge({ 'list_owner': udata.id }).fetchAll()
And secondly, when checking for contact match to tell whether some information should be visible to an specified user, depending on the information owner's privacy settings:
let checkContact = await Contacts.where({
list_owner: target_user,
contact: udata.id
}).fetch()
After several grepping, I can guarantee that there is no other place in our codebase that SELECTS from the contacts table. In our debugging, we found no undefined values, and our investigation showed that the query runs when the former code runs. But as you can see in the screenshot, the query knex runs has no conditions:
select `contacts`.* from `contacts`
We believe this is the reason why it fills the pools (as requesting every user's contact is quite a job), but at the same time, we cannot see why knex is running such query, as:
- No code changes has been done after knex upgrade
- The problem does not exist when using the old knex version (our production server is up and running with the outdated knex version)
- We do a LOT of caching using Redis (but whatever, the DB is not overloaded and the old Knex version works)
- If the issue was really that the conditions were missing, we could have discovered it before, as every user would see the same contacts list.
What could be causing such problem?