3

I am trying to use aws multipart upload using aws SDK and spark and file size is around 14GB but getting out of memory error. Its giving error at this line - val bytes: Array[Byte] = IOUtils.toByteArray(is)

I have tried to bump up driver memory and executor memory to 100 G and tried few other spark optimizations.

Below is the code I am trying with :-

val tm = TransferManagerBuilder.standard.withS3Client(s3Client).build
      val fs = FileSystem.get(new Configuration())
      val filePath = new Path(hdfsFilePath)
      val is:InputStream = fs.open(filePath)
      val om = new ObjectMetadata()
      val bytes: Array[Byte] = IOUtils.toByteArray(is)
      om.setContentLength(bytes.length)
      val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(bytes)
      val request = new PutObjectRequest(bucketName, keyName, byteArrayInputStream, om).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey)).withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
      val upload = tm.upload(request)

And this is the Exception I am getting :-

java.lang.OutOfMemoryError
                at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
                at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
                at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
                at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
                at com.amazonaws.util.IOUtils.toByteArray(IOUtils.java:45)
Arpan
  • 913
  • 2
  • 12
  • 19
  • If the library supports it, it would probably be better to send a stream rather than converting the whole thing to a byte array. Here is a related question that you can look at to guide you on more things to look for: https://stackoverflow.com/questions/29105178/uploading-large-file-to-s3-with-ruby-fails-with-out-of-memory-error-how-to-read?rq=1 – JoseM Jun 24 '19 at 17:00
  • Hi, you can have a look at [Benji S3 lib](https://zengularity.github.io/benji/s3/usage.html) (I'm a contributor of), which support Akka Streams on object storage ops – cchantep Jun 25 '19 at 10:12

1 Answers1

0

PutObjectRequest accepts File:

public PutObjectRequest(String bucketName, String key, File file)

Something like the following should work (I haven't checked though):

val result = TransferManagerBuilder.standard.withS3Client(s3Client)
  .build
  .upload(
    new PutObjectRequest(
      bucketName,
      keyName,
      new File(new Path(hdfsFilePath))
    )
    .withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams(kmsKey))
    .withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
  )
Sergey Romanovsky
  • 4,216
  • 4
  • 25
  • 27