1

I am trying to read a file from a third party AWS S3 bucket which is in a .gz format. I need to process the data in the file and upload the file to our own S3 Bucket.

For reading the file, I am creating a readStream from S3.getBucket as shown below:

const fileStream = externalS3.getObject({Bucket: <bucket-name>, Key: <key>}).createReadStream();

For making the code more efficient, I am planning to use the same fileStream for both processing the contents and uploading to our own S3. I have the code below, which does not upload the file to the internal S3 bucket.

import Stream from "stream";

const uploadStream = fileStream.pipe(new stream.PassThrough());
const readStream = fileStream.pipe(new stream.PassThrough());

await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();

readStream.pipe(createGunzip())
.on("error", err =>{console.log(err)})
.pipe(JSONStream.parse())
.on("data", data => {console.log(data)});

However, the code below successfully uploads the file to the internal s3 bucket.

const uploadStream = fileStream.pipe(new stream.PassThrough());


await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();

What am I doing wrong here ?

NOTE: If I use separate fileStreams to upload and read data, it works fine. However, I need to achieve this using the same fileStream.

Rahul Sharma
  • 5,562
  • 4
  • 24
  • 48
Rachit Anand
  • 616
  • 5
  • 16
  • Your script works fine for me. Could you please show the error when the object is not getting uploaded to S3? Also please mention the node version you're using. – Rahul Sharma Sep 01 '22 at 11:23
  • @Rahul Sharma - First of all Thanks for responding. I do not get an error. Seems like the upload to S3 is just stuck. Node version I am using is 16.6.2 – Rachit Anand Sep 01 '22 at 14:52
  • What is the file size you're trying to upload? Could you also tell the time it takes to process the stream **(1)** uploading the file to s3 **(2)** reading and creating gunzip and logging it to console? – Rahul Sharma Sep 01 '22 at 16:14
  • The file size varies, but is mostly up of 1GB. The upload, if I use individual streams, usually take about 2 minutes. gunzip is a lot quicker. I am performing more async functions on the unzipped data that I have not mentioned here. With those functions, it will take longer. – Rachit Anand Sep 01 '22 at 16:23
  • 1
    I think this is happening because the stream has already ended while sdk was trying to upload the file because the `readStream` is a lot quicker and requesting data at higher rate than the upload which has to put data on a remote location over a network. See related issue: https://github.com/aws/aws-sdk-js/issues/3004. You can fix this by the solution mentioned here: https://stackoverflow.com/a/33879208/1973735. – Rahul Sharma Sep 01 '22 at 16:29

1 Answers1

1

The files that you are trying to upload to S3 have a relatively large size (~1 GB) as mentioned by OP. Two streams are being created here piping the single fileStream:

const uploadStream = fileStream.pipe(new stream.PassThrough());
const readStream = fileStream.pipe(new stream.PassThrough());

While the operations on readStream are less time consuming uploadStream is responsible for uploading the file to a remote location, in this case S3, over a network which takes relatively more time. This also means that the readStream is pulling/requesting the data from the fileStream at a higher rate. By the time readStream has finished, the fileStream is already consumed and the .upload call to aws sdk hangs. See this issue.

You can fix it by making use of this library to synchronise the two different streams. An example of how to achieve that can be found here.

Rahul Sharma
  • 5,562
  • 4
  • 24
  • 48