How to batch process an async read stream?

Question

Im trying to batch process the reading of a file and posting to a database. Currently, I am trying to batch it 20 records at a time, as seen below.

Despite the documentBatch.length check I have put in, it still seems to not be working (the database call inside persistToDB should be called 5 times, for some reason it's only called once) and console logging documentBatch.length, it is hitting higher than that limit. I suspect this is due to concurrency issues, however the persistToDB is from an extrnal lib that needs to be called within an async function.

The way I am trying to batch is to pause the stream and resume the stream once the db work is done, however this seems to be having the same issue.

let documentBatch = [];
  const processedMetrics = {
    succesfullyProcessed: 0,
    unsuccesfullyProcessed: 0,
  };

  rl.on('line', async (line) => {
    try {
      const document = JSON.parse(line);
      documentBatch.push(document);
      console.log(documentBatch.length);
      if (documentBatch.length === 20) {
        rl.pause();
        const batchMetrics = await persistToDB(documentBatch);
        documentBatch = [];
        processedMetrics.succesfullyProcessed +=
          batchMetrics.succesfullyProcessed;
        processedMetrics.unsuccesfullyProcessed +=
          batchMetrics.unsuccesfullyProcessed;
        rl.resume();
      }
    } catch (e) {
      logger.error(`Failed to save document ${line}`);
      throw e;
    }
  });

If you have to pause the stream to process data, I think you don't need an async stream here at all. I would rather just read the data line-by-line, add each line to the batch, and then process the batch in sync. You do not benefit from the async 'line' event, and its async nature pushes you toward adding workaround code. Here is a similar question with the similar conclusion - https://stackoverflow.com/questions/52725360/make-async-call-from-readlines-line-callback — RAllen, Feb 06 '23 at 20:12
I would do that, but the batch could be up to 50k in size, which typically exceeds the memory limits for AWS Lambda (hence the batching). — Sheen, Feb 07 '23 at 17:35
I meant that you could read line by line, and once you have collected 20 items, you persist them. You can do batching, but you don't need async reading via stream in this scenario. It doesn't give you any advantage because the sole purpose of your process is to read line-by-line and persist batches. This is a sync scenario by its nature. — RAllen, Feb 07 '23 at 23:05

How to batch process an async read stream?

0 Answers0