5

I have a lot of files on S3 that I need to zip and then provide the zip via S3. Currently I zip them from stream to a local file and then upload the file again. This takes up a lot of disk space, as each file has around 3-10MB and I have to zip up to 100.000 files. So a zip can have more than 1TB. So I would like a solution just along this lines:

Create a zip file on S3 from files on S3 using Lambda Node

Here it seams the zip is created directly on S3 without taking up local disk space. But I am just not smart enough to transfer the above solution to Java. I am also finding conflicting information on the java aws sdk, saying that they planned on changing the stream behavior in 2017.

Not sure if this will help, but here's what I've been doing so far (Upload is my local model that holds S3 information). I just removed logging and stuff for better readability. I think I am not taking up space for the download "piping" the InputStream directly into the zip. But like I said I would also like to avoid the local zip file and create it directly on S3. That however would probably require the ZipOutputStream to be created with S3 as target instead of a FileOutputStream. Not sure how that can be done.

public File zipUploadsToNewTemp(List<Upload> uploads) {
    List<String> names = new ArrayList<>();

    byte[] buffer = new byte[1024];
    File tempZipFile;
    try {
      tempZipFile = File.createTempFile(UUID.randomUUID().toString(), ".zip");
    } catch (Exception e) {
      throw new ApiException(e, BaseErrorCode.FILE_ERROR, "Could not create Zip file");
    }
    try (
        FileOutputStream fileOutputStream = new FileOutputStream(tempZipFile);
        ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream)) {

      for (Upload upload : uploads) {
        InputStream inputStream = getStreamFromS3(upload);
        ZipEntry zipEntry = new ZipEntry(upload.getFileName());
        zipOutputStream.putNextEntry(zipEntry);
        writeStreamToZip(buffer, zipOutputStream, inputStream);
        inputStream.close();
      }
      zipOutputStream.closeEntry();
      zipOutputStream.close();
      return tempZipFile;
    } catch (IOException e) {
      logError(type, e);
      if (tempZipFile.exists()) {
        FileUtils.delete(tempZipFile);
      }
      throw new ApiException(e, BaseErrorCode.IO_ERROR,
          "Error zipping files: " + e.getMessage());
    }
}

  // I am not even sure, but I think this takes up memory and not disk space
private InputStream getStreamFromS3(Upload upload) {
    try {
      String filename = upload.getId() + "." + upload.getFileType();
      InputStream inputStream = s3FileService
          .getObject(upload.getBucketName(), filename, upload.getPath());
      return inputStream;
    } catch (ApiException e) {
      throw e;
    } catch (Exception e) {
      logError(type, e);
      throw new ApiException(e, BaseErrorCode.UNKOWN_ERROR,
          "Unkown Error communicating with S3 for file: " + upload.getFileName());
    }
}


private void writeStreamToZip(byte[] buffer, ZipOutputStream zipOutputStream,
      InputStream inputStream) {
    try {
      int len;
      while ((len = inputStream.read(buffer)) > 0) {
        zipOutputStream.write(buffer, 0, len);
      }
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, "Could not write stream to zip");
    }
}

And finally the upload Source code. Inputstream is created from the Temp Zip file.

public PutObjectResult upload(InputStream inputStream, String bucketName, String filename, String folder) {
    String uploadKey = StringUtils.isEmpty(folder) ? "" : (folder + "/");
    uploadKey += filename;

    ObjectMetadata metaData = new ObjectMetadata();

    byte[] bytes;
    try {
      bytes = IOUtils.toByteArray(inputStream);
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, e.getMessage());
    }
    metaData.setContentLength(bytes.length);
    ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);

    PutObjectRequest putObjectRequest = new PutObjectRequest(bucketPrefix + bucketName, uploadKey, byteArrayInputStream, metaData);
    putObjectRequest.setCannedAcl(CannedAccessControlList.PublicRead);

    try {
      return getS3Client().putObject(putObjectRequest);
    } catch (SdkClientException se) {
      throw s3Exception(se);
    } finally {
      IOUtils.closeQuietly(inputStream);
    }
  }

Just found a similar question to what I need also without answer:

Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java

Pete
  • 10,720
  • 25
  • 94
  • 139
  • Why does it take disk space? Why are you saving the downloaded bytes to disk in the first place. If you don't, it won't take disk space. How about posting what you tried, so that wa could explain how to do it better? – JB Nizet Jul 02 '19 at 07:07
  • I didn't want to overcomplicate the question. It's quite a lot of source code and i probably cannot be improved the way I want it to. I feel it'll be better to start from scratch – Pete Jul 02 '19 at 07:09
  • Then start from scratch. And just don't write the downloaded objects to a file, so that it doesn't take any disk space. – JB Nizet Jul 02 '19 at 07:10
  • I would suggest using an Amazon EC2 instance (as low as 1c/hour, or you could even use a Spot Instance to get it at a lower price). Write a script to loop through the files, then download, zip, upload. If the EC2 instance is in the same region as Amazon S3 then there is no Data Transfer charge. – John Rotenstein Jul 02 '19 at 07:22
  • I've added my current source code for zipping and upload – Pete Jul 02 '19 at 07:27
  • Your code doesn't write anything to disk except for the zip file. But even then, that's useless: you write bytes to a zip file, then read the whole zip file as an input stream, then transform the input stream to a byte array, and then finally upload the byte array. So why don't you just write to a ByteArrayOutputStream in memory, then get the byte array created by the byte array output stream and upload that? Or is the actual problem something completely other than "it takes too much disk space"? – JB Nizet Jul 02 '19 at 07:46
  • I just noticed that the above has another problem: Reading it all to memory. That won't work for zip files that can be 100s of GB large. I really need a Stream directly to S3 with which I can initialize my Zipper – Pete Jul 02 '19 at 07:50
  • I agree with @JohnRotenstein that you should keep this inside your AWS account. You avoid the egress charges, and data transfer performance will be significantly higher. I would investigate AWS Batch for your task too, which may be well suited to your requirements. – Simon B Jul 02 '19 at 12:18

2 Answers2

2

You can get input stream from your S3 data, then zip this batch of bytes and stream it back to S3

        long numBytes;  // length of data to send in bytes..somehow you know it before processing the entire stream
        PipedOutputStream os = new PipedOutputStream();
        PipedInputStream is = new PipedInputStream(os);
        ObjectMetadata meta = new ObjectMetadata();
        meta.setContentLength(numBytes);

        new Thread(() -> {
            /* Write to os here; make sure to close it when you're done */
            try (ZipOutputStream zipOutputStream = new ZipOutputStream(os)) {
                ZipEntry zipEntry = new ZipEntry("myKey");
                zipOutputStream.putNextEntry(zipEntry);
                
                S3ObjectInputStream objectContent = amazonS3Client.getObject("myBucket", "myKey").getObjectContent();
                byte[] bytes = new byte[1024];
                int length;
                while ((length = objectContent.read(bytes)) >= 0) {
                    zipOutputStream.write(bytes, 0, length);
                }
                objectContent.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }).start();
        amazonS3Client.putObject("myBucket", "myKey", is, meta);
        is.close();  // always close your streams
0

I would suggest using an Amazon EC2 instance (as low as 1c/hour, or you could even use a Spot Instance to get it at a lower price). Smaller instance types are lower cost but have limited bandwidth, so play around with the size to get your preferred performance.

Write a script to loop through the files then:

  • Download
  • Zip
  • Upload
  • Delete local files

All the zip magic happens on local disk. No need to use streams. Just use the Amazon S3 download_file() and upload_file() calls.

If the EC2 instance is in the same region as Amazon S3 then there is no Data Transfer charge.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470