2

I have a lambda triggered by a SQS FIFO queue when there are messages on this queue. Basically this lambda is getting the message from the queue and connecting to QLDB through a VPC endpoint in order to run a simple SELECT query and a subsequent INSERT query. The table selected by the query has a index for the field used in the where condition.

Flow (all the services are running "inside" a VPC):

SQS -> Lambda -> VPC interface endpoint -> QLDB

Query SELECT:

SELECT FIELD1, FIELD2 FROM TABLE1 WHERE FIELD3 = "ABCDE"

Query INSERT:

INSERT INTO TABLE1 .....

This lambda is using a shared connection/session on QLDB and this is how I'm connecting to it.

import { QldbDriver, RetryConfig } from 'amazon-qldb-driver-nodejs'

let driverQldb: QldbDriver
const ledgerName = 'MyLedger'

export function connectQLDB(): QldbDriver {
  if ( !driverQldb ) {
    const retryLimit = 4
    const retryConfig = new RetryConfig(retryLimit)
    const maxConcurrentTransactions = 1500
    driverQldb = new QldbDriver(ledgerName, {}, maxConcurrentTransactions, retryConfig)
  }
  return driverQldb
}

When I run a load test that simulates around 200 requests/messages per second to that lambda in a time interval of 15 minutes, I'm starting facing a random long execution for that lambda while running the queries on QLDB (mainly the SELECT query). Sometimes the same query retrieves data around 100ms and sometimes it takes more than 40 seconds which results in lambda timeouts. I have changed lambda timeout to 1 minute but this is not the best approch and sometimes it is not enough too.

The VPC endpoint metrics are showing around 250 active connections and 1000 new connections during this load test execution. Is there any QLDB metric that could help to identify the root cause of this behavior?

Could it be related to some QLDB limitation (like the 1500 active sessions described here: https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.default) or something related to concurrency read/write iops?

  • Any errors being returned on from the QLDB driver? QLDB's max transaction time is 30 seconds, so Im wondering if you're getting any errors from the queries. Did you verify your table has an index for the `FIELD3` column being filtered? Also, try setting the RetryConfig with a backoff function like this example line, or else your current code as presented will use the default backoff of up to 5 seconds. https://github.com/awslabs/amazon-qldb-driver-nodejs/blob/72b6657b436a41e6f14e2ef00a638d6ba7187433/src/integrationtest/SessionManagement.test.ts#L141 – bwinchester Sep 20 '22 at 14:04
  • Hi @bwinchester thank you for your answer. No errors are being returned from QLDB driver and I also have an index for the FIELD3. The long execution is pretty randomic. I was thinking about some network overloaded on VPC endpoint side but I didn't find any limitation on the endpoint documentation. – Thiago Scodeler Sep 20 '22 at 14:19
  • Have you logged out IO and timing information? It might help you pinpoint networking or query specific latency. https://docs.aws.amazon.com/qldb/latest/developerguide/working.statement-stats.html After that, you might collect transaction IDs, ledger IDs, and contact AWS Support so they can help dig in with you. – bwinchester Sep 20 '22 at 16:08
  • No, I'm not logging IO/timing info. I'll take a look at this documentation. When checking CloudWatch for QLDB metrics, I'm getting this result during load test execution for read/write IO (Count statistic "Sum" in Period of 1 minute): WriteIO = 36,000 and ReadIO = 12,000 – Thiago Scodeler Sep 20 '22 at 16:32
  • Fast SELECT query: 2022-09-19T18:55:59.313Z 2a41b140-7fe1-5c7a-a19e-eae2c29d291b INFO -> Will get data from QLDB 2022-09-19T18:55:59.538Z 2a41b140-7fe1-5c7a-a19e-eae2c29d291b INFO -> data got from QLDB – Thiago Scodeler Sep 20 '22 at 20:25
  • Too long SELECT query: 2022-09-19T18:34:13.494Z 9abbd669-74f4-5765-b3ab-8dcbcc90f792 INFO -> Will get data from QLDB 2022-09-19T18:34:26.570Z 9abbd669-74f4-5765-b3ab-8dcbcc90f792 INFO -> data got from QLDB – Thiago Scodeler Sep 20 '22 at 20:26

1 Answers1

0

scodeler, I've read through the NodeJS QLDB driver, and I think theres an order of operations error. If you provide your own backoff function in the RetryConfig where RetryConfig(4, newBackoffFunction), you should see significant performance improvement in your lambda's completing.

The driver's default backoff

const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, Math.pow(SLEEP_BASE_MS * 2,  retryAttempt));

summarized...it returns
return Math.random() * exponentialBackoff;

does not match the default best jitter function practices

const newBackoffFunction: BackoffFunction = (retryAttempt: number, error: Error, transactionId: string) => {
    const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, SLEEP_BASE_MS * Math.pow(2,  retryAttempt));
    const jitterRand: number = Math.random();
    const delayTime: number = jitterRand * exponentialBackoff;
    return delayTime;
}

The difference is that the SLEEP_BASE_MS should be multiplied by 2 ^ retryAttempt, and not (SLEEP_BASE_MS x 2) ^ retryAttempt.

Hope this helps!

bwinchester
  • 91
  • 1
  • 5