Upload of large files using azure-sdk-for-java with limited heap

Question

We are developing document microservice that needs to use Azure as a storage for file content. Azure Block Blob seemed like a reasonable choice. Document service has heap limited to 512MB (-Xmx512m).

I was not successful getting streaming file upload with limited heap to work using azure-storage-blob:12.10.0-beta.1 (also tested on 12.9.0).

Following approaches were attempted:

Copy-paste from the documentation using BlockBlobClient

BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

File file = new File("file");

try (InputStream dataStream = new FileInputStream(file)) {
  blockBlobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: java.io.IOException: mark/reset not supported - SDK tries to use mark/reset even though file input stream reports this feature as not supported.

Adding BufferedInputStream to mitigate mark/reset issue (per advice):

BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

File file = new File("file");

try (InputStream dataStream = new BufferedInputStream(new FileInputStream(file))) {
  blockBlobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: java.lang.OutOfMemoryError: Java heap space. I assume that SDK attempted to load all 1.17GB of file content into memory.

Replacing BlockBlobClient with BlobClient and removing heap size limitation (-Xmx512m):

BlobClient blobClient = blobContainerClient.getBlobClient("file");

File file = new File("file");

try (InputStream dataStream = new FileInputStream(file)) {
  blobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: 1.5GB of heap memory used, all file content is loaded into memory + some buffering on the side of Reactor

Heap usage from VisualVM

Switch to streaming via BlobOutputStream:

long blockSize = DataSize.ofMegabytes(4L).toBytes();

BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

// create / erase blob
blockBlobClient.commitBlockList(List.of(), true);

BlockBlobOutputStreamOptions options = (new BlockBlobOutputStreamOptions()).setParallelTransferOptions(
  (new ParallelTransferOptions()).setBlockSizeLong(blockSize).setMaxConcurrency(1).setMaxSingleUploadSizeLong(blockSize));

try (InputStream is = new FileInputStream("file")) {
  try (OutputStream os = blockBlobClient.getBlobOutputStream(options)) {
    IOUtils.copy(is, os); // uses 8KB buffer
  }
}

Result: file is corrupted during upload. Azure web portal shows 1.09GB instead of expected 1.17GB. Manual download of the file from Azure web portal confirms that file content was corrupted during upload. Memory footprint decreased significantly, but file corruption is a showstopper.

Problem: cannot come up with a working upload / download solution with small memory footprint

Any help would be greatly appreciated!

"file is corrupted during upload. Azure web portal shows 1.09GB instead of expected 1.17GB", could it be that Azure web portal shows [Gibibyte / GiB](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units) (i.e. 1024³ bytes) instead of Gigabyte (i.e. 1000³ bytes)? Because 1.17 GB ≈ 1.09 GiB. (Though if you confirmed locally that the uploaded file is corrupt, then that might not be the answer) — Marcono1234, Dec 21 '20 at 17:23
@Marcono1234 yes, I verified that byte size for original and uploaded files are the same. So by all means you are right. However, the file itself is corrupted (e.g uploaded image has 60% of grey pixels, uploaded video is not playable). I used this fragment to verify that byte size is the same (as well as checking it manually via download & compare): `log.info("EXPECTED SIZE: {}; ACTUAL SIZE: {}", image.length(), blockBlobClient.getProperties().getBlobSize());` Created this query in GitHub for sanity check: https://github.com/Azure/azure-sdk-for-java/issues/18295 — white-sagittarius, Dec 21 '20 at 19:04

Stanley Gong · Accepted Answer · 2020-12-22T04:16:05.810

Pls try the code below to upload/download big files, I have tested on my side using a .zip file with size about 1.1 GB

For uploading files:

public static void uploadFilesByChunk() {
                String connString = "<conn str>";
                String containerName = "<container name>";
                String blobName = "UploadOne.zip";
                String filePath = "D:/temp/" + blobName;

                BlobServiceClient client = new BlobServiceClientBuilder().connectionString(connString).buildClient();
                BlobClient blobClient = client.getBlobContainerClient(containerName).getBlobClient(blobName);
                long blockSize = 2 * 1024 * 1024; //2MB
                ParallelTransferOptions parallelTransferOptions = new ParallelTransferOptions()
                                .setBlockSizeLong(blockSize).setMaxConcurrency(2)
                                .setProgressReceiver(new ProgressReceiver() {
                                        @Override
                                        public void reportProgress(long bytesTransferred) {
                                                System.out.println("uploaded:" + bytesTransferred);
                                        }
                                });

                BlobHttpHeaders headers = new BlobHttpHeaders().setContentLanguage("en-US").setContentType("binary");

                blobClient.uploadFromFile(filePath, parallelTransferOptions, headers, null, AccessTier.HOT,
                                new BlobRequestConditions(), Duration.ofMinutes(30));
        }

Memory footprint:

For downloading files:

public static void downLoadFilesByChunk() {
                String connString = "<conn str>";
                String containerName = "<container name>";
                String blobName = "UploadOne.zip";

                String filePath = "D:/temp/" + "DownloadOne.zip";

                BlobServiceClient client = new BlobServiceClientBuilder().connectionString(connString).buildClient();
                BlobClient blobClient = client.getBlobContainerClient(containerName).getBlobClient(blobName);
                long blockSize = 2 * 1024 * 1024;
                com.azure.storage.common.ParallelTransferOptions parallelTransferOptions = new com.azure.storage.common.ParallelTransferOptions()
                                .setBlockSizeLong(blockSize).setMaxConcurrency(2)
                                .setProgressReceiver(new com.azure.storage.common.ProgressReceiver() {
                                        @Override
                                        public void reportProgress(long bytesTransferred) {
                                                System.out.println("dowloaded:" + bytesTransferred);
                                        }
                                });

                BlobDownloadToFileOptions options = new BlobDownloadToFileOptions(filePath)
                                .setParallelTransferOptions(parallelTransferOptions);
                blobClient.downloadToFileWithResponse(options, Duration.ofMinutes(30), null);
        }

Memory footprint:

Result:

Thank you for the quick response! I tried your approach and it worked great. Here's [screenshot](https://user-images.githubusercontent.com/76443987/102862430-c393cc80-4439-11eb-8eb2-62528ef47951.png) of heap usage during upload / download. I believe, that file usage allows Azure SDK to skip quite a few copying / buffering steps. File is not corrupted. The only little inconvenience is that we receive our data from the network in the form of `InputStream` and would have to write it to a temporary file in order to leverage `uploadFromFile` API — white-sagittarius, Dec 22 '20 at 08:05
by the way, I noticed that only 6 files are uploaded / downloaded in parallel. If I work with more than 6 files in parallel, the rest is waiting for the first 6 to complete. Do you happen to know if there is a setting to control this? — white-sagittarius, Dec 22 '20 at 09:58
My assumption that there is some hidden limitation of up to 6 files uploaded / downloaded concurrently was preposterous as I discovered that it was due to Google Chrome settings - I was using Swagger to call upload / download endpoints (Chrome has a limit of 6 connections per host name, and a max of 10 connections) — white-sagittarius, Dec 23 '20 at 13:28
@white-sagittarius, Thanks for this tips, I am not so familiar about it before — Stanley Gong, Dec 24 '20 at 07:45

Upload of large files using azure-sdk-for-java with limited heap

1 Answers1

Linked