Download files, archive and transfer to S3 using streams

Question

I'm using the code from this question to archive files using node-archiver and transfer them to S3. My specific task requires me to download a large number of files from URLs, zip them to one archive, and transfer them to S3.

I'm using the "got" Javascript library to do this.

for (const file of files) {
  const { resultFileName, fileUrl} = getFileNameAndUrl(file);
  if (!fileUrl)
    continue;

  const downloadStream = got.stream(fileUrl, {
    retry: {
      limit: 5
    }
  });

  archive.append(downloadStream, { name: resultFileName });
}

The rest of the code is pretty much the same as in the original question. The issue is that script doesn't work well with a huge amount of files (it just finishes execution at some point).

In the perfect world - I want this script to download files, append them to archive and transfer them to S3 using pipes. And the best is to download them in batches (something like Promise.map with concurrency in bluebird). I just don't get how to do it with Streams, as I do have not much experience with them.

Can you post the full script you're running? "Pretty much the same" means there's some differences and we can only guess as to what those are. — inwerpsel, Jul 17 '22 at 16:03
Do you need to do this in javascript? Or would a command-line Linux solution work for you? — cabad, Jul 19 '22 at 17:24

score 1 · Answer 1 · answered Jul 19 '22 at 08:19

archiver package processes one file at a time, so there is no point in downloading several in parallel with got. Follow the example by that link you provided and it should work.

Also, do not open a lot of streams to all files should be zipped. Do that one by one, since streams and archived package have timeouts on opened streams.

score 0 · Answer 2 · answered Jul 20 '22 at 16:01

I hope this helps.

NOTE: I could not test this because I don't have access to aws s3.

This snippet should download webpages and saves it in zip file, which should contain fs.html & index.html file.

// file:main.mjs
import got from 'got'
import archiver from 'archiver'
import S3 from 'aws-sdk/clients/s3'
import { basename } from 'path'

try {
  const urls = ['https://nodejs.org/api/fs.html', 'https://nodejs.org/api/index.html']
  const gotconfig = {}

  const archive = archiver('zip', {
    zlib: { level: 9 },
  })

  archive.on('warning', function (err) {
    if (err.code === 'ENOENT') {
    } else {
      throw err
    }
  })

  archive.on('error', function (err) {
    throw err
  })

  for (const url of urls) {
    // const _url = new URL(url)
    archive.append(got.stream(url, gotconfig), { name: basename(url) })
  }

  const s3 = new S3()
  await s3.upload({ Bucket: 'buck', Key: 'key', Body: archive }).promise()

  await archive.finalize()
} catch (error) {
  console.error(error)
}

this one I have tested & it works. Similar to above but saves zip file in /tmp/test1.zip.

// file: local.mjs
import got from 'got'
import { createWriteStream } from 'fs'
import archiver from 'archiver'
import { basename } from 'path'

try {
  const urls = ['https://nodejs.org/api/fs.html', 'https://nodejs.org/api/index.html']
  const gotconfig = { }

  const output = createWriteStream('/tmp/test1.zip')

  const archive = archiver('zip', {
    zlib: { level: 9 },
  })

  output.on('close', function () {
    console.log(archive.pointer() + ' total bytes')
    console.log('archiver has been finalized and the output file descriptor has closed.')
  })

  output.on('end', function () {
    console.log('Data has been drained')
  })

  archive.on('warning', function (err) {
    if (err.code === 'ENOENT') {
    } else {
      throw err
    }
  })

  archive.on('error', function (err) {
    throw err
  })

  archive.pipe(output)

  for (const url of urls) {
    archive.append(got.stream(url, gotconfig), { name: basename(url) })
  }

  await archive.finalize()
} catch (error) {
  console.error(error)
}

Download files, archive and transfer to S3 using streams

2 Answers2