Upload blocks in parallel in blob storage

Question

I am trying to convert this to parallel to improve the upload times of a file but with what I have tried it has not had great changes in time. I want to upload the blocks side-by-side and then confirm them. How could I manage to do it in parallel?

public static async Task UploadInBlocks
    (BlobContainerClient blobContainerClient, string localFilePath, int blockSize)
{
    string fileName = Path.GetFileName(localFilePath);
    BlockBlobClient blobClient = blobContainerClient.GetBlockBlobClient(fileName);

    FileStream fileStream = File.OpenRead(localFilePath);

    ArrayList blockIDArrayList = new ArrayList();

    byte[] buffer;

    var bytesLeft = (fileStream.Length - fileStream.Position);

    while (bytesLeft > 0)
    {
        if (bytesLeft >= blockSize)
        {
            buffer = new byte[blockSize];
            await fileStream.ReadAsync(buffer, 0, blockSize);
        }
        else
        {
            buffer = new byte[bytesLeft];
            await fileStream.ReadAsync(buffer, 0, Convert.ToInt32(bytesLeft));
            bytesLeft = (fileStream.Length - fileStream.Position);
        }

        using (var stream = new MemoryStream(buffer))
        {
            string blockID = Convert.ToBase64String
                (Encoding.UTF8.GetBytes(Guid.NewGuid().ToString()));
            
            blockIDArrayList.Add(blockID);


            await blobClient.StageBlockAsync(blockID, stream);
        }

        bytesLeft = (fileStream.Length - fileStream.Position);

    }

    string[] blockIDArray = (string[])blockIDArrayList.ToArray(typeof(string));

    await blobClient.CommitBlockListAsync(blockIDArray);
}

_"I am trying to convert this to parallel to improve the upload times of a file"_ - **it won't**: uploads are network-IO bound, but parallelization only benefits CPU-bound activities. Due to the overhead of concurrent network connections and transfers you're more likely to slow things down this way. — Dai, Jan 24 '23 at 18:53
Your `while` loop is doing things slowly: youi're reallocating large buffers inside a loop: _don't do that_ - and I think your code is also incorrect because you're not checking the return value of `fileStream.ReadAsync`. and you shouldn't be doing `new MemoryStream` inside a loop - nor doing convoluted things like `Convert.ToBase64String (Encoding.UTF8.GetBytes(Guid.NewGuid().ToString())` - because `Guid.ToString()` returns Base16 digits which are already URI-safe so the rigmaore with UTF8 Bytes and Base64-encoding is just going to confuse people for no benefit. — Dai, Jan 24 '23 at 18:56
Also, your entire code is... renventing the wheel: `GetBlockBlobClient` can already directly upload a `FileStream` very efficiently - oh, and there's another bug in your code: `FileStream` is not _true async_ unless you use the `isAsync` ctor. And another bug: `var bytesLeft = (fileStream.Length - fileStream.Position);` will always be just `fileStream.Length` at start - and `Convert.ToInt32(bytesLeft)` will fail if you try to use a file sized larger than 2GB. — Dai, Jan 24 '23 at 18:59
This question might be useful: [Parallel foreach with asynchronous lambda](https://stackoverflow.com/questions/15136542/parallel-foreach-with-asynchronous-lambda). — Theodor Zoulias, Jan 24 '23 at 20:23

Luaan · Answer 1 · 2023-01-24T19:33:53.317

1

Of course. You shouldn't expect any improvements - quite the opposite. Blob storage doesn't have any simplistic throughput throttling that would benefit from uploading in multiple streams, and you're already doing extremely light-weight I/O which is going to be entirely I/O bound.

Good I/O code has absolutely no benefits from parallelization. No matter how many workers you put on the job, the pipe is only this thick and will not allow you to pass more data through.

All your code just reimplements the already very efficient mechanisms that the blob storage library has... and you do it considerably worse, with pointless allocation, wrong arguments and new opportunities for bugs. Don't do that. The library can deal with streams just fine.

edited Jan 24 '23 at 19:33

answered Jan 24 '23 at 19:30

Luaan

62,244
7
97
116

2

There are exceptions to this, however - while obviously concurrency isn't going to help when a pipe is already saturated, but having multiple concurrent requests for things like Azure Blob storage blob enumeration is definitely significantly faster due to how most of the client's time is spent waiting for Azure to return a response - and things are load-balanced at their end (but client-side, those concurrent requests can all be made using a single thread if you know how to do it right). – Dai Jan 24 '23 at 19:33

Upload blocks in parallel in blob storage

1 Answers1