0

I have the following use case to solve. I need to ingest data from a S3 bucket using a Lambda function (NodeJS 12). The Lambda function will be triggered when a new file is created. The file is a gz archive and can contain multiple TSV (tab-separated) files. For each row an API call will be triggered from the Lambda function. Questions:

1 - Does it have to be a two-steps process: uncompress the archive in a /tmp folder and then read the TSV files. Or can you directly stream the content of the archive file?

2 - Do you have a snippet of code that you could share that shows how to stream a GZ file from S3 bucket and its content (TSV)? I've found few examples but only for pure NodeJS. Not from Lambda/S3.

Thanks a lot for your help.

Adding a snippet of code for my first test and it doesnt work. No data is logged in the console

const csv = require('csv-parser')
const aws = require('aws-sdk');
const s3 = new aws.S3();


exports.handler = async(event, context, callback) => {
    const bucket = event.Records[0].s3.bucket.name;
    const objectKey = event.Records[0].s3.object.key;
    const params = { Bucket: bucket, Key: objectKey };
    var results = [];

    console.log("My File: "+objectKey+"\n")
    console.log("My Bucket: "+bucket+"\n")


    var otherOptions = {
        columns: true,
        auto_parse: true,
        escape: '\\',
        trim: true,
    };

    s3.getObject(params).createReadStream()
        .pipe(csv({ separator: '|' }))
        .on('data', (data) => results.push(data))
        .on('end', () => {
            console.log("My data: "+results);
        });

    return await results
};


Cyrillou
  • 151
  • 1
  • 15
  • "Do you have a snippet of code that you could share for this part of the function?" can you define "this part"? – MyStackRunnethOver Nov 21 '19 at 00:04
  • Hi @MyStackRunnethOver. Thanks for getting back to me. I meant for streaming the content of the TSV files which are in the GZ archive. – Cyrillou Nov 21 '19 at 00:10

2 Answers2

0

You may want to take a look at the wiki:

Although its file format also allows for multiple [compressed files / data streams] to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file[5]), gzip is normally used to compress just single files. Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.

What this means is that by itself, gzip (or a Node package to use it) is not powerful enough to decompress a single .gz file into multiple files. I hope that if a single .gz item in S3 contains more than one file, it's actually a .tar.gz or similar compressed collection. To deal with these, check out

Simplest way to download and unzip files in NodeJS

You may also be interested in node-tar.

In terms of getting just one file out of the archive at at time, this depends on what the compressed collection actually is. Some compression schemes allow extracting just one file at a time, others don't (they require you decompress the whole thing in one go). Tar does the former.

MyStackRunnethOver
  • 4,872
  • 2
  • 28
  • 42
0

First step should be to decompress .tar.gz file, using the package decompress

// typescript code for decompressing .tar.gz file
const decompress = require("decompress");
try {
  const targzSrc = await aws.s3
    .getObject({
      Bucket: BUCKET_NAME,
      Key: fileRequest.key
    });


  const filesPromise = decompress(targzSrc.Body);
  const outputFileAsString = await filesPromise.then((files: any) => {
    console.log("inflated file:", files["0"].data.toString("utf-8"));
    return files["0"].data.toString("utf-8");
  });
  
  console.log("And here goes the file content:", outputFileAsString);

  // here should be the code that parses the CSV content using the outputFileAsString

} catch (err) {
  console.log("G_ERR:", err);
}
vencedor
  • 663
  • 7
  • 9