41

I need to create a Zip file that consists of a selection of files (videos and images) located in my s3 bucket.

The problem at the moment using my code below is that I quickly hit the memory limit on Lambda.

async.eachLimit(files, 10, function(file, next) {
    var params = {
        Bucket: bucket, // bucket name
        Key: file.key
    };
    s3.getObject(params, function(err, data) {
        if (err) {
            console.log('file', file.key);
            console.log('get image files err',err, err.stack); // an error occurred
        } else {
            console.log('file', file.key);
            zip.file(file.key, data.Body);
            next();
        }
    });
}, 
function(err) {
    if (err) {
        console.log('err', err);
    } else {
        console.log('zip', zip);
        content = zip.generateNodeStream({
            type: 'nodebuffer',
            streamFiles:true
        });
        var params = {
            Bucket: bucket, // name of dest bucket
            Key: 'zipped/images.zip',
            Body: content
        };
        s3.upload(params, function(err, data) {
            if (err) {
                console.log('upload zip to s3 err',err, err.stack); // an error occurred
            } else {
                console.log(data); // successful response
            }
        });
    }
});
  • Is this possible using Lambda, or should I look at a different approach?

  • Is it possible to write to a compressed zip file on the fly, therefore eliminating the memory issue somewhat, or do I need to have the files collected before compression?

Any help would be much appreciated.

alex
  • 5,467
  • 4
  • 33
  • 43
Rabona
  • 520
  • 1
  • 5
  • 8

6 Answers6

55

Okay, I got to do this today and it works. Direct Buffer to Stream, no disk involved. So memory or disk limitation won't be an issue here:

'use strict';

const AWS = require("aws-sdk");
AWS.config.update( { region: "eu-west-1" } );
const s3 = new AWS.S3( { apiVersion: '2006-03-01'} );

const   _archiver = require('archiver');

//This returns us a stream.. consider it as a real pipe sending fluid to S3 bucket.. Don't forget it
const streamTo = (_bucket, _key) => {
 var stream = require('stream');
 var _pass = new stream.PassThrough();
 s3.upload( { Bucket: _bucket, Key: _key, Body: _pass }, (_err, _data) => { /*...Handle Errors Here*/ } );
 return _pass;
};
      
exports.handler = async (_req, _ctx, _cb) => {
 var _keys = ['list of your file keys in s3'];
 
    var _list = await Promise.all(_keys.map(_key => new Promise((_resolve, _reject) => {
            s3.getObject({Bucket:'bucket-name', Key:_key})
                .then(_data => _resolve( { data: _data.Body, name: `${_key.split('/').pop()}` } ));
        }
    ))).catch(_err => { throw new Error(_err) } );

    await new Promise((_resolve, _reject) => { 
        var _myStream = streamTo('bucket-name', 'fileName.zip');  //Now we instantiate that pipe...
        var _archive = _archiver('zip');
        _archive.on('error', err => { throw new Error(err); } );
        
        //Your promise gets resolved when the fluid stops running... so that's when you get to close and resolve
        _myStream.on('close', _resolve);
        _myStream.on('end', _resolve);
        _myStream.on('error', _reject);
        
        _archive.pipe(_myStream);   //Pass that pipe to _archive so it can push the fluid straigh down to S3 bucket
        _list.forEach(_itm => _archive.append(_itm.data, { name: _itm.name } ) );  //And then we start adding files to it
        _archive.finalize();    //Tell is, that's all we want to add. Then when it finishes, the promise will resolve in one of those events up there
    }).catch(_err => { throw new Error(_err) } );
    
    _cb(null, { } );  //Handle response back to server
};
iocoker
  • 606
  • 6
  • 9
  • 7
    would it be possible to stream the zip back as the response, directly to the "user" ? – bobmoff Jan 07 '19 at 23:12
  • 2
    @bobmoff, with Lambda, NO because we don't have a direct access to the HTTP response stream. In regular NodeJs environment, YES that's possible. Checking the NodeJs docs, **HTTP requests and responses** are a part of the Writable Streams objects: [link](https://nodejs.org/api/stream.html#stream_class_stream_writable) just like zlib, fs etc but we don't have a direct access to that in Lambda; we only use the callback function to communicate back to the client. – iocoker Jan 10 '19 at 04:54
  • 2
    I just noticed that you dont stream the objects from s3, which could cause memory problems if downloading many files, right? I am not sure what the .get method does in tour example, as it doesnt exist on aws-sdk. Guess u mean getObject. But anyway as you are waiting for all the downloaded objects to finish that means that all those object have to live in memory at the same time. Lambda have limited memory, so I guess it would have to use the .getReadStream instead and append those to the archiver, or am I missing something? – bobmoff Jan 10 '19 at 23:03
  • 1
    @bobmoff That's correct, sorry we have a wrapper for all s3 operations, should be getObject and you should pass in a return function. – iocoker Jan 11 '19 at 09:54
  • @bobmoff As regards memory issue, we just cranked up the memory for that function to somewhere 3008MB. But you are right, streaming the objects one at a time will save more memory as against having buffered up in the list. I'll try that and update the code – iocoker Jan 11 '19 at 10:02
  • 4
    I just tried it using s3.getObject().createReadStream() and the memory usage stops are around 170mb for the my test of 100 files of around 3-10mb. Compared to without the stream it goes up to 730mb. It is also faster as it doesn't have to wait for all the objects to be downloaded first and THEN start the stream to the bucket. – bobmoff Jan 11 '19 at 10:17
  • 1
    After implementing this solution I got to the point where too many Streams were opened and I received a TimeoutError. For those who get into this, please read this [issue out](https://github.com/aws/aws-sdk-js/issues/2087) for fully understanding the problem and also the possible solutions. – Dimas Crocco Mar 17 '20 at 19:50
  • is this approach possible in C#.Net? – LP13 Nov 24 '20 at 18:39
  • @bobmoff, could you suggest an update to this answer with your code including `s3.getObject().createReadStream()`? I think that would be helpful to future viewers. In the interim, checkout this gist: https://gist.github.com/amiantos/16bacc9ed742c91151fcf1a41012445e – Ben Stickley Oct 11 '22 at 20:28
13

I formated the code according to @iocoker.

main entry

// index.js

'use strict';
const S3Zip = require('./s3-zip')

const params = {
  files: [
    {
      fileName: '1.jpg',
      key: 'key1.JPG'
    },
    {
      fileName: '2.jpg',
      key: 'key2.JPG'
    }
  ],
  zippedFileKey: 'zipped-file-key.zip'
}

exports.handler = async event => {
  const s3Zip = new S3Zip(params);
  await s3Zip.process();

  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: 'Zip file successfully!'
      }
    )
  };

}

Zip file util

// s3-zip.js

'use strict';
const fs = require('fs');
const AWS = require("aws-sdk");

const Archiver = require('archiver');
const Stream = require('stream');

const https = require('https');
const sslAgent = new https.Agent({
  KeepAlive: true,
  rejectUnauthorized: true
});
sslAgent.setMaxListeners(0);
AWS.config.update({
  httpOptions: {
    agent: sslAgent,
  },
  region: 'us-east-1'
});

module.exports = class S3Zip {
  constructor(params, bucketName = 'default-bucket') {
    this.params = params;
    this.BucketName = bucketName;
  }

  async process() {
    const { params, BucketName } = this;
    const s3 = new AWS.S3({ apiVersion: '2006-03-01', params: { Bucket: BucketName } });

    // create readstreams for all the output files and store them
    const createReadStream = fs.createReadStream;
    const s3FileDwnldStreams = params.files.map(item => {
      const stream = s3.getObject({ Key: item.key }).createReadStream();
      return {
        stream,
        fileName: item.fileName
      }
    });

    const streamPassThrough = new Stream.PassThrough();
    // Create a zip archive using streamPassThrough style for the linking request in s3bucket
    const uploadParams = {
      ACL: 'private',
      Body: streamPassThrough,
      ContentType: 'application/zip',
      Key: params.zippedFileKey
    };

    const s3Upload = s3.upload(uploadParams, (err, data) => {
      if (err) {
        console.error('upload err', err)
      } else {
        console.log('upload data', data);
      }
    });

    s3Upload.on('httpUploadProgress', progress => {
      // console.log(progress); // { loaded: 4915, total: 192915, part: 1, key: 'foo.jpg' }
    });

    // create the archiver
    const archive = Archiver('zip', {
      zlib: { level: 0 }
    });
    archive.on('error', (error) => {
      throw new Error(`${error.name} ${error.code} ${error.message} ${error.path} ${error.stack}`);
    });

    // connect the archiver to upload streamPassThrough and pipe all the download streams to it
    await new Promise((resolve, reject) => {
      console.log("Starting upload of the output Files Zip Archive");

      streamPassThrough.on('close', resolve());
      streamPassThrough.on('end', resolve());
      streamPassThrough.on('error', reject());

      archive.pipe(streamPassThrough);
      s3FileDwnldStreams.forEach((s3FileDwnldStream) => {
        archive.append(s3FileDwnldStream.stream, { name: s3FileDwnldStream.fileName })
      });
      archive.finalize();

    }).catch((error) => {
      throw new Error(`${error.code} ${error.message} ${error.data}`);
    });

    // Finally wait for the uploader to finish
    await s3Upload.promise();

  }
}
OhadR
  • 8,276
  • 3
  • 47
  • 53
zhenqi li
  • 351
  • 2
  • 5
  • 1
    Works like a charm. Thank you so much Zhenqi Li. I found this answer very well structured and easier to understand than the other answers here. – Hardik Shah Jun 10 '21 at 08:26
  • @HardikShah You are welcome. I'm glad to be able to help. – zhenqi li Jun 23 '21 at 07:28
  • 1
    update: so this works well for zipping 600-700 files (all small in size ~2mb) but as soon as the count of files goes beyond that, the zip is not created, no error logged. All files being zipped are small in size ~2-3 mb. any idea what could be going wrong? – Hardik Shah Jun 24 '21 at 04:59
  • There is nothing wrong on my side, even if a single file reaches 7MB. And my General configuration is: Memory (3008MB), Timeout(15min). You can check your general configuration or debugger the code. @HardikShah – zhenqi li Jun 29 '21 at 13:32
  • @zhenqili, I have created a lambda function and put this code and configuration but lambda function is throwing this error "trace": [ "Runtime.ImportModuleError: Error: Cannot find module 'archiver'" Could you please help to resolve it? – Saurabh Jun 30 '21 at 04:57
  • @Saurabh This is not a complete code. You should install the dependencies and upload the node_modules. – zhenqi li Jun 30 '21 at 06:47
  • from some reason, i get on line `s3Upload.on('close', resolve()); `: `Argument of type '"end"' is not assignable to parameter of type '"httpUploadProgress"'` – OhadR Jul 07 '21 at 11:12
  • Fairly new to Lambda world, I tried copying the code in a Lambda function which is throwing timeout error. How can we make the code above working, like is there any other files associated? – Praveen Govind Jul 08 '21 at 00:59
  • in my case the event triggered after archive finalized was, streamPassThrough.on('finish', resolve()); – Cristiano Sarmento Jan 18 '23 at 14:45
8

The other solutions are great for not so many files (less than ~60). If they handle more files, they just quit into nothing with no errors. This is because they open too many streams.

This solution is inspired by https://gist.github.com/amiantos/16bacc9ed742c91151fcf1a41012445e

It is a working solution, which works well even with many files (+300) and returns a presigned URL to the zip which contains the files.

Main Lambda:

const AWS = require('aws-sdk');
const S3 = new AWS.S3({
  apiVersion: '2006-03-01',
  signatureVersion: 'v4',
  httpOptions: {
    timeout: 300000 // 5min Should Match Lambda function timeout
  }
});
const archiver = require('archiver');
import stream from 'stream';

const UPLOAD_BUCKET_NAME = "my-s3-bucket";
const URL_EXPIRE_TIME = 5*60;

export async function getZipSignedUrl(event) {
  const prefix = `uploads/id123123/}`;   //replace this with your S3 prefix
  let files = ["12314123.png", "56787567.png"]  //replace this with your files

  if (files.length == 0) {
    console.log("No files to zip");
    return result(404, "No pictures to download");
  }
  console.log("Files to zip: ", files);

  try {
    files = files.map(file => {
        return {
            fileName: file,
            key: prefix + '/' + file,
            type: "file"
        };
    });
    const destinationKey = prefix + '/' + 'uploads.zip'
    console.log("files: ", files);
    console.log("destinationKey: ", destinationKey);

    await streamToZipInS3(files, destinationKey);
    const presignedUrl = await getSignedUrl(UPLOAD_BUCKET_NAME, destinationKey, URL_EXPIRE_TIME, "uploads.zip");
    console.log("presignedUrl: ", presignedUrl);

    if (!presignedUrl) {
      return result(500, null);
    }
    return result(200, presignedUrl);
  }
  catch(error) {
    console.error(`Error: ${error}`);
    return result(500, null);
  }
}

Helper functions:

export function result(code, message) {
  return {
    statusCode: code,
    body: JSON.stringify(
      {
        message: message
      }
    )
  }
}

export async function streamToZipInS3(files, destinationKey) {
  await new Promise(async (resolve, reject) => {
    var zipStream = streamTo(UPLOAD_BUCKET_NAME, destinationKey, resolve);
    zipStream.on("error", reject);

    var archive = archiver("zip");
    archive.on("error", err => {
      throw new Error(err);
    });
    archive.pipe(zipStream);

    for (const file of files) {
      if (file["type"] == "file") {
        archive.append(getStream(UPLOAD_BUCKET_NAME, file["key"]), {
          name: file["fileName"]
        });
      }
    }
    archive.finalize();
  })
  .catch(err => {
    console.log(err);
    throw new Error(err);
  });
}

function streamTo(bucket, key, resolve) {
  var passthrough = new stream.PassThrough();
  S3.upload(
    {
      Bucket: bucket,
      Key: key,
      Body: passthrough,
      ContentType: "application/zip",
      ServerSideEncryption: "AES256"
    },
    (err, data) => {
      if (err) {
        console.error('Error while uploading zip')
        throw new Error(err);
        reject(err)
        return
      }
      console.log('Zip uploaded')
      resolve()
    }
  ).on("httpUploadProgress", progress => {
    console.log(progress)
  });
  return passthrough;
}

function getStream(bucket, key) {
  let streamCreated = false;
  const passThroughStream = new stream.PassThrough();

  passThroughStream.on("newListener", event => {
    if (!streamCreated && event == "data") {
      const s3Stream = S3
        .getObject({ Bucket: bucket, Key: key })
        .createReadStream();
      s3Stream
        .on("error", err => passThroughStream.emit("error", err))
        .pipe(passThroughStream);

      streamCreated = true;
    }
  });

  return passThroughStream;
}

export async function getSignedUrl(bucket: string, key: string, expires: number, downloadFilename?: string): Promise<string> {
    const exists = await objectExists(bucket, key);
    if (!exists) {
        console.info(`Object ${bucket}/${key} does not exists`);
        return null
    }

    let params = {
        Bucket: bucket,
        Key: key,
        Expires: expires,
    };
    if (downloadFilename) {
        params['ResponseContentDisposition'] = `inline; filename="${encodeURIComponent(downloadFilename)}"`; 
    }
    
    try {
        const url = s3.getSignedUrl('getObject', params);
        return url;
    } catch (err) {
        console.error(`Unable to get URL for ${bucket}/${key}`, err);
        return null;
    }
};
NJones
  • 193
  • 3
  • 10
  • 1
    Used this snippet as a source of inspiration for my own needs. Very useful and working great, thanks a lot ! – PoC Sep 21 '21 at 15:50
  • Cheers! Glad to hear it works! This is why I put this up here! – NJones Sep 22 '21 at 16:04
1

Using streams may be tricky as I'm not sure how you could pipe multiple streams into an object. I've done this several times using standard file object. It's a multistep process and it's quite fast. Remember that Lambda operates in Linux so you have all Linux resources at hand including the system /tmp directory.

  1. Create a sub-directory in /tmp call "transient" or whatever works for you
  2. Use s3.getObject() and write file objects to /tmp/transient
  3. Use the GLOB package to generate an array[] of paths from /tmp/transient
  4. Loop the array and zip.addLocalFile(array[i]);
  5. zip.writeZip('tmp/files.zip');
jp_inc
  • 345
  • 4
  • 14
  • 1
    The only issue I can see with this is that lambda is limited to 500mb storage in the tmp directory. In this case it would also limit the final zip size. – Rabona Aug 30 '16 at 10:29
  • 1
    Not sure if you're running any file processing along side the .zip process, but with that amount of data, you make want to make sure your function can complete within the 5 minute execution time frame. My largest data size is typically around 20-25mg per execution. – jp_inc Aug 31 '16 at 18:57
  • @Rabona did you manage to solve this issue via lambda? I'm having the same issue. We need to zip a 1.5GB video file with about 100Mb of images. We run out of memory. We have also tried with a smaller video file (~1gb) with the same images and get timeouts. Hoping you may have uncovered something useful that could help us out too. – Forer Oct 28 '16 at 10:57
  • 2
    We eventually solved this issue using a Java streaming solution. This allowed us to bypass the memory issues. – Rabona Nov 01 '16 at 12:02
  • 1
    could you please share how did you solve it using Java streaming solution? – Oleg Jul 04 '17 at 14:15
  • Did you use that library to generate a zip or just transport files? – jp_inc Jul 26 '17 at 23:54
  • We used the library to generate a zip in an S3 bucket. – Rabona Aug 01 '17 at 08:30
  • Hallo, please consider reading this [issue](https://github.com/aws/aws-sdk-js/issues/2087) to fully understand what is going on when dealing with streams on nodejs environment. – Dimas Crocco Mar 17 '20 at 19:52
0

I've used a similar approach, but I'm facing the issue that some of the files in the generated ZIP file don't have the correct size (and corresponding data). Is there any limitation on the size of the files this code can manage? In my case I'm zipping large files (a few larger than 1GB) and the overall amount of data may reach 10GB.

I do not get any error/warning message, so it seems it all works fine.

Any idea what may be hapenning?

Didac Busquets
  • 605
  • 1
  • 5
  • 8
0

You can use adm-zip which allows you to deal with zip files directly on disk or in memory buffers. It's also simpler to use compared to the node-archiver library which also has an un-addressed issue.

TypeScript Code:

import AdmZip from "adm-zip";

import { GetObjectCommand, GetObjectCommandOutput, PutObjectCommand, PutObjectCommandInput } from "@aws-sdk/client-s3";

export async function uploadZipFile(fileKeysToDownload: string[], bucket: string, uploadFileKey: string): Promise<void> {
    
  // create a new zip file using "adm-zip"
  let zipFile = new AdmZip();

  // Download the existing files in S3 using GET API
  // use parallel fetch in your code, for loop is shown here for simplicity
  // invoke GET APIs for each element in fileKeysToDownload
  // i = 0 -> (fileKeysToDownload.length - 1) 
  const data = await getObject(fileKeysToDownload[i], bucket);
  const byteArray = await data!.transformToByteArray();

  // add the byte arrays to the newly created zip file
  zipFile.addFile(fileKeysToDownload[i], Buffer.from(byteArray));

  // Convert this zip file to a byte array 
  const outputBody = zip.toBuffer();

  // upload zip file to S3 using the PUT API
  await putObject(outputBody, uploadFileKey);
};

async function getObject(key: string, bucket: string){
  const command: GetObjectCommand = new GetObjectCommand({Bucket: bucket, Key: key});
  const response: GetObjectCommandOutput = await s3.send(command);
  return response.Body;
}

async function putObject(content: Buffer, key: string, bucket: string){
  const input: PutObjectCommandInput = {
    Body: content,
    Bucket: bucket,
    Key: key,
    ContentType: "application/zip"
  }
  const response = await s3.send(
    new PutObjectCommand(input)
  );
}

Is this possible using Lambda, or should I look at a different approach? -> Yes, it is possible.

Is it possible to write to a compressed zip file on the fly, therefore eliminating the memory issue somewhat, or do I need to have the files collected before compression? -> Yes, please use the above approach using adm-zip.

himan085
  • 21
  • 3