Java, how can you chunk pieces of a large inputstream efficiently?

Question

I have an input stream that is potentially 20-30mb. I'm trying to upload chunks as a multi-part file upload to S3.

I have the content-length available and I have the input-stream available. How can I efficiently do this with memory in mind.

I saw someone had done something like this, but not sure I fully understand it:

    int contentLength = inputStreamMetadata.getContentLength();
    int partSize = 512 * 1024; // Set part size to 2 MB
    int filePosition = 0;

    ByteArrayInputStream bais = inputStreamMetadata.getInputStream();
    List<PartETag> partETags = new ArrayList<>();
    byte[] chunkedFileBytes = new byte[partSize];
    for (int i = 1; filePosition < contentLength; i++) {
      // Because the last part could be less than 5 MB, adjust the part size as needed.
      partSize = Math.min(partSize, (contentLength - filePosition));

      filePosition += bais.read(chunkedFileBytes, filePosition, partSize);

      // Create the request to upload a part.
      UploadPartRequest uploadRequest = new UploadPartRequest()
          .withBucketName(bucketName)
          .withUploadId(uploadId)
          .withKey(fileName)
          .withPartNumber(i)
          .withInputStream(new ByteArrayInputStream(chunkedFileBytes, 0, partSize))
          .withPartSize(partSize);

      UploadPartResult uploadResult = client.uploadPart(uploadRequest);
      partETags.add(uploadResult.getPartETag());
    }
}

Specifically this piece: .withInputStream(new ByteArrayInputStream(bytes, 0, bytesRead))

actually it aligns with [AWS low level API upload doc](https://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html) ... one difference: the sample uses `withFile` and `withFileOffset`, where you use `withInputStream` (seemingly also correct: with an InputStream from the currently loaded chunk) ...one *tiny* problem, I see, is ..the last iteration (`bytesRead <= 0`) ..but would also test how it behaves. — xerx593, Mar 23 '20 at 21:33
The API should have a way of enabling chunked transfer mode, which does it all for you. — user207421, Mar 23 '20 at 21:49
I updated the code, but getting [message=Range [524288, 524288 + 179947) out of bounds for length 524288, error. and I don't know why — Ryan, Mar 23 '20 at 22:06
..but is the effort (of chunking) worth for 20-30mb!? (i met guys here @[so], who bumped 2GB via `putObject` (knowing the file size;) ...https://stackoverflow.com/q/54379555/592355) — xerx593, Mar 23 '20 at 22:07
Maybe not honestly, the uploads are happening very slow for me with just a few mb, but not sure why. I was thinking the parts were all done in parallel, but don't really see that being the case. — Ryan, Mar 23 '20 at 22:07

score 1 · Answer 1 · answered Mar 23 '20 at 22:31

1

Sorry, i cannot (easily) test it, but I think you are really close, ... just have to "fix" and "arrange" your loop!

Combining https://stackoverflow.com/a/22128215/592355 with your latest code:

int partSize = 5 * 1024 * 1024; // Set part size to 5 MB
ByteArrayInputStream bais = inputStreamMetadata.getInputStream();
List<PartETag> partETags = new ArrayList<>();
byte[] buff = new byte[partSize];
int partNumber = 1;
while (true) {//!
    int readBytes = bais.read(buff);// readBytes in [-1 .. partSize]!
    if (readBytes == -1) { //EOF
        break;
    }
    // Create the request to upload a part.
    UploadPartRequest uploadRequest = new UploadPartRequest()
                .withBucketName(bucketName)
                .withUploadId(uploadId)
                .withKey(fileName)
                .withPartNumber(partNumber++)
                .withInputStream(new ByteArrayInputStream(buff, 0, readBytes))
                .withPartSize(readBytes);

    UploadPartResult uploadResult = client.uploadPart(uploadRequest);
    partETags.add(uploadResult.getPartETag());
}
// Complete the multipart upload.... 
// https://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html

answered Mar 23 '20 at 22:31

xerx593

12,237
5
33
64

Thanks! I will try it out. Is there any speed benefit for uploading files with this approach? Like if I did a single putObjectReq or this with 10mb, it would probably take about the same time, right? – Ryan Mar 23 '20 at 22:46
-tell us! :) ..but I doubt so...with "upload" not the "count of threads/parts" is the bottleneck, but your/client's "upstream" (and what arrives at aws) ..you could realize a "big (runtime) advantage" if you could "fire & forget" ... please also "test" the ["high level" approach](https://docs.aws.amazon.com/AmazonS3/latest/dev/HLuploadFileJava.html) and especially (try to avoid) the `waitForCompletion();` part. – xerx593 Mar 23 '20 at 22:58
I would do high-level, but dealing with this input stream, I'm not sure that's possible.. I would like to do the asyc upload though – Ryan Mar 23 '20 at 23:19
no problem - high level offers also an [`upload(InputStream)`](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#upload-java.lang.String-java.lang.String-java.io.InputStream-com.amazonaws.services.s3.model.ObjectMetadata-), but longs also for "meta data"(->file size, which is known to you!?) – xerx593 Mar 23 '20 at 23:23
So in my above case, what if the last chunk I read isn't quit 5MB, let's say is 500kb, then what would happen? readBytes would be -1? – Ryan Mar 23 '20 at 23:27
..nah, `readBytes` would be `500*1024` in the "second last" iteration...and `-1` in the last one. – xerx593 Mar 23 '20 at 23:29
that's one "advantage" of my answer/low level - it doesn't care about/need file size, it is also "fire & forget" (so far) ...just choose a bigger chunk size (that most of the (possible) files fit in 1;) – xerx593 Mar 23 '20 at 23:31
ok gotcha. Does the high-level API do all of this logic for me? So I would literally just pass it my parent input stream and it would chunk it for me elsewhere? ByteArrayInputStream bais = inputStreamMetadata.getInputStream(); – Ryan Mar 23 '20 at 23:32
that's what "high level" would mean for me:) .. and user207421 (275k rep.!) probably tried to tell. – xerx593 Mar 23 '20 at 23:34
Dang.. I was maybe way-overcomplicating this then with all this custom byte logic.. – Ryan Mar 24 '20 at 00:16

Java, how can you chunk pieces of a large inputstream efficiently?

1 Answers1