5

We are trying to connect to AWS DocumentDB in a lambda which is built on Express in Serverless. To do this we're using mongoose and a connection function that looks something like

import mongoose from 'mongoose';
import logger from './utils/logger';
import fs from 'fs';

const READYSTATE_CONNECTED = 1;

const mongoDB = process.env.MONGODB_URI;

const certificateFilePath = __dirname + '/rds-combined-ca-bundle.pem';
logger.info(`Loading certificate file from ${certificateFilePath}`);
let ca = [fs.readFileSync(certificateFilePath)];


logger.info('Connection is ' + mongoose.connection.readyState);
if (mongoose.connection.readyState !== READYSTATE_CONNECTED) {
    logger.info(`Connecting to mongo using env connection string ${mongoDB}`);
    mongoose.connect(mongoDB, { useNewUrlParser: true, useUnifiedTopology: true, checkServerIdentity: false, ssl: true, sslCA: ca }).catch((err) => {
        logger.error(`Unable to connect to mongoose due to ${err.reason}`);
        console.error(err);
    });
}
mongoose.Promise = global.Promise;
const db = mongoose.connection;

// eslint-disable-next-line no-console
db.on('error', console.error.bind(console, 'MongoDB connection error:'));

export default db;

The idea here being that we maintain a connection and reuse it to avoid the expense of creating new connections for each request that comes into the lambda. For the most part this works fine but every once in a while (perhaps 2x a day) we see problems connecting to the database. It seems to crash the lambda pretty hard and we have to trigger a change on the lambda to trick lambda into restarting our application after which all works fine again for another few hours. We run in 4 identical environments and it seems like the production environment is the only one which experiences this problem. Production is slightly busier than the other environments but really only by 50%.

The error looks like

2020-11-09T20:10:36.565Z d88c9b33-6b84-44cd-8c1d-297c6334aad5 ERROR MongooseServerSelectionError: connection timed out
at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:800:32)
at /var/task/node_modules/mongoose/lib/index.js:342:10
at /var/task/node_modules/mongoose/lib/helpers/promiseOrCallback.js:31:5
at new Promise (<anonymous>)
at promiseOrCallback (/var/task/node_modules/mongoose/lib/helpers/promiseOrCallback.js:30:10)
at Mongoose.connect (/var/task/node_modules/mongoose/lib/index.js:341:10)
at Object.<anonymous> (/var/task/src/mongoose.js:19:24)
at Module._compile (internal/modules/cjs/loader.js:1137:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10)
at Module.load (internal/modules/cjs/loader.js:985:32)
at Function.Module._load (internal/modules/cjs/loader.js:878:14)
at Module.require (internal/modules/cjs/loader.js:1025:19)
at require (internal/modules/cjs/helpers.js:72:18)
at Object.<anonymous> (/var/task/src/AppBuilder.js:17:1)
at Module._compile (internal/modules/cjs/loader.js:1137:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10) {
reason: TopologyDescription {
type: 'ReplicaSetNoPrimary',
setName: 'rs0',
maxSetVersion: null,
maxElectionId: null,
servers: Map {
'documentdbmasterinstance-xxxx.xxx.us-east-1.docdb.amazonaws.com:27017' => [ServerDescription],
'documentdbreplica1instance-xxxx.xxxx.us-east-1.docdb.amazonaws.com:27017' => [ServerDescription],
'documentdbreplica2instance-xxxx.xxxx.us-east-1.docdb.amazonaws.com:27017' => [ServerDescription]
},
stale: false,
compatible: true,
compatibilityError: null,
logicalSessionTimeoutMinutes: null,
heartbeatFrequencyMS: 10000,
localThresholdMS: 15,
commonWireVersion: 6
}

Thus far we've been unable to pinpoint any particular action which causes this. There does look to be a slight increase in connections to the database at the time but only to about 75 connections and we're running on r5.large which should allow 1700 connections so we're well off that limit.

I was unsure if the mention of ReplicaSetNoPrimary in the error log is a red herring but it doesn't seem to mentioned anywhere in similar issue reports. I am suspicious about if the connection is really timing out. None of the lambda invocations take more than 200ms.

I suppose the questions are:

  • Is there anything obvious in the connection code which would cause this?
  • Is there a better, more canonical way to establish and maintain connections in this express application turned lambda?
  • Is the ReplicaSetNoPrimary indicative that there is some issue with the documentdb electing a new primary or the primary being unreachable?
  • Any suggestions for more logging I could add to chase this down?

Edit: Our connection strings look like

mongodb://redacted:redacted@prod-db.cluster-cvgzkbo26lzb.us-east-1.docdb.amazonaws.com:27017/database?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false
stimms
  • 42,945
  • 30
  • 96
  • 149
  • try passing connectTimeoutMS and socketTimeoutMS options when connecting to DB – Nonik Nov 19 '20 at 18:38
  • @nonik are there recommendations on what values to set those to? I've just been using the defaults thus far. – stimms Nov 21 '20 at 22:37
  • here are connection options i pass { bufferCommands: false, // Disable mongoose buffering bufferMaxEntries: 0, // and MongoDB driver buffering useNewUrlParser: true, reconnectTries: Number.MAX_VALUE, reconnectInterval: 500, autoReconnect: true, poolSize: 15, socketTimeoutMS: 2000000, keepAlive: true, connectTimeoutMS: 45000, //socketTimeoutMS: 3000, ssl: true, family: 4, // Use IPv4, skip trying IPv6 useUnifiedTopology:true, useCreateIndex: true } – Nonik Nov 23 '20 at 17:39
  • 5
    Did you manage to solve this? Having pretty much the same problem expect in my case I'm trying to connect from a NodeJS app. – philosopher Dec 05 '20 at 14:47
  • 1
    @philosopher I opened tickets with the lambda and documentdb teams both of whom were pretty useless and just sent me links from around the internet. In the end I created a mongodb atlas account set up VPC peering and deleted the document db. Atlas has worked flawlessly so I'm forced to believe that this is a problem with document db. I should blog this experience, if I do I'll update here – stimms Dec 21 '20 at 21:33

1 Answers1

-2

There could be multiple reasons for the connection timeout. The most common reason is to whitelist your IP address or enable public access so you can access the database.

The other reason could be the protocol you're using. For more information please share the format of the connection string so I can check and update my answer accordingly.