8

We are developing document microservice that needs to use Azure as a storage for file content. Azure Block Blob seemed like a reasonable choice. Document service has heap limited to 512MB (-Xmx512m).

I was not successful getting streaming file upload with limited heap to work using azure-storage-blob:12.10.0-beta.1 (also tested on 12.9.0).

Following approaches were attempted:

  1. Copy-paste from the documentation using BlockBlobClient
BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

File file = new File("file");

try (InputStream dataStream = new FileInputStream(file)) {
  blockBlobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: java.io.IOException: mark/reset not supported - SDK tries to use mark/reset even though file input stream reports this feature as not supported.

  1. Adding BufferedInputStream to mitigate mark/reset issue (per advice):
BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

File file = new File("file");

try (InputStream dataStream = new BufferedInputStream(new FileInputStream(file))) {
  blockBlobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: java.lang.OutOfMemoryError: Java heap space. I assume that SDK attempted to load all 1.17GB of file content into memory.

  1. Replacing BlockBlobClient with BlobClient and removing heap size limitation (-Xmx512m):
BlobClient blobClient = blobContainerClient.getBlobClient("file");

File file = new File("file");

try (InputStream dataStream = new FileInputStream(file)) {
  blobClient.upload(dataStream, file.length(), true /* overwrite file */);
}

Result: 1.5GB of heap memory used, all file content is loaded into memory + some buffering on the side of Reactor

Heap usage from VisualVM

  1. Switch to streaming via BlobOutputStream:
long blockSize = DataSize.ofMegabytes(4L).toBytes();

BlockBlobClient blockBlobClient = blobContainerClient.getBlobClient("file").getBlockBlobClient();

// create / erase blob
blockBlobClient.commitBlockList(List.of(), true);

BlockBlobOutputStreamOptions options = (new BlockBlobOutputStreamOptions()).setParallelTransferOptions(
  (new ParallelTransferOptions()).setBlockSizeLong(blockSize).setMaxConcurrency(1).setMaxSingleUploadSizeLong(blockSize));

try (InputStream is = new FileInputStream("file")) {
  try (OutputStream os = blockBlobClient.getBlobOutputStream(options)) {
    IOUtils.copy(is, os); // uses 8KB buffer
  }
}

Result: file is corrupted during upload. Azure web portal shows 1.09GB instead of expected 1.17GB. Manual download of the file from Azure web portal confirms that file content was corrupted during upload. Memory footprint decreased significantly, but file corruption is a showstopper.

Problem: cannot come up with a working upload / download solution with small memory footprint

Any help would be greatly appreciated!

  • "file is corrupted during upload. Azure web portal shows 1.09GB instead of expected 1.17GB", could it be that Azure web portal shows [Gibibyte / GiB](https://en.wikipedia.org/wiki/Byte#Multiple-byte_units) (i.e. 1024³ bytes) instead of Gigabyte (i.e. 1000³ bytes)? Because 1.17 GB ≈ 1.09 GiB. (Though if you confirmed locally that the uploaded file is corrupt, then that might not be the answer) – Marcono1234 Dec 21 '20 at 17:23
  • @Marcono1234 yes, I verified that byte size for original and uploaded files are the same. So by all means you are right. However, the file itself is corrupted (e.g uploaded image has 60% of grey pixels, uploaded video is not playable). I used this fragment to verify that byte size is the same (as well as checking it manually via download & compare): `log.info("EXPECTED SIZE: {}; ACTUAL SIZE: {}", image.length(), blockBlobClient.getProperties().getBlobSize());` Created this query in GitHub for sanity check: https://github.com/Azure/azure-sdk-for-java/issues/18295 – white-sagittarius Dec 21 '20 at 19:04

1 Answers1

6

Pls try the code below to upload/download big files, I have tested on my side using a .zip file with size about 1.1 GB

For uploading files:

public static void uploadFilesByChunk() {
                String connString = "<conn str>";
                String containerName = "<container name>";
                String blobName = "UploadOne.zip";
                String filePath = "D:/temp/" + blobName;

                BlobServiceClient client = new BlobServiceClientBuilder().connectionString(connString).buildClient();
                BlobClient blobClient = client.getBlobContainerClient(containerName).getBlobClient(blobName);
                long blockSize = 2 * 1024 * 1024; //2MB
                ParallelTransferOptions parallelTransferOptions = new ParallelTransferOptions()
                                .setBlockSizeLong(blockSize).setMaxConcurrency(2)
                                .setProgressReceiver(new ProgressReceiver() {
                                        @Override
                                        public void reportProgress(long bytesTransferred) {
                                                System.out.println("uploaded:" + bytesTransferred);
                                        }
                                });

                BlobHttpHeaders headers = new BlobHttpHeaders().setContentLanguage("en-US").setContentType("binary");

                blobClient.uploadFromFile(filePath, parallelTransferOptions, headers, null, AccessTier.HOT,
                                new BlobRequestConditions(), Duration.ofMinutes(30));
        }

Memory footprint: enter image description here

For downloading files:

public static void downLoadFilesByChunk() {
                String connString = "<conn str>";
                String containerName = "<container name>";
                String blobName = "UploadOne.zip";

                String filePath = "D:/temp/" + "DownloadOne.zip";

                BlobServiceClient client = new BlobServiceClientBuilder().connectionString(connString).buildClient();
                BlobClient blobClient = client.getBlobContainerClient(containerName).getBlobClient(blobName);
                long blockSize = 2 * 1024 * 1024;
                com.azure.storage.common.ParallelTransferOptions parallelTransferOptions = new com.azure.storage.common.ParallelTransferOptions()
                                .setBlockSizeLong(blockSize).setMaxConcurrency(2)
                                .setProgressReceiver(new com.azure.storage.common.ProgressReceiver() {
                                        @Override
                                        public void reportProgress(long bytesTransferred) {
                                                System.out.println("dowloaded:" + bytesTransferred);
                                        }
                                });

                BlobDownloadToFileOptions options = new BlobDownloadToFileOptions(filePath)
                                .setParallelTransferOptions(parallelTransferOptions);
                blobClient.downloadToFileWithResponse(options, Duration.ofMinutes(30), null);
        }

Memory footprint: enter image description here

Result: enter image description here

Stanley Gong
  • 11,522
  • 1
  • 8
  • 16
  • Thank you for the quick response! I tried your approach and it worked great. Here's [screenshot](https://user-images.githubusercontent.com/76443987/102862430-c393cc80-4439-11eb-8eb2-62528ef47951.png) of heap usage during upload / download. I believe, that file usage allows Azure SDK to skip quite a few copying / buffering steps. File is not corrupted. The only little inconvenience is that we receive our data from the network in the form of `InputStream` and would have to write it to a temporary file in order to leverage `uploadFromFile` API – white-sagittarius Dec 22 '20 at 08:05
  • by the way, I noticed that only 6 files are uploaded / downloaded in parallel. If I work with more than 6 files in parallel, the rest is waiting for the first 6 to complete. Do you happen to know if there is a setting to control this? – white-sagittarius Dec 22 '20 at 09:58
  • My assumption that there is some hidden limitation of up to 6 files uploaded / downloaded concurrently was preposterous as I discovered that it was due to Google Chrome settings - I was using Swagger to call upload / download endpoints (Chrome has a limit of 6 connections per host name, and a max of 10 connections) – white-sagittarius Dec 23 '20 at 13:28
  • @white-sagittarius, Thanks for this tips, I am not so familiar about it before – Stanley Gong Dec 24 '20 at 07:45
  • @StanleyGong What does `setMaxConcurrency` do here ? – Gaurav Jan 04 '21 at 17:06