2

I'm trying to create a CSV file and to import it to an Azure storage account.

public static void ExportCsvToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<ReportRow> reportEntries)
{
    using (var ms = new MemoryStream())
    {
        using (var file = new StreamWriter(ms))
        {
            file.WriteLine("Date,StoreId,ItemId,SalesQuantity");

            foreach (var row in reportEntries)
            {
                var line = $"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"";
                file.WriteLine(line);
            }

            var blockBlob = container.GetBlockBlobReference($"{fileName}.csv");
            ms.Position = 0;
            blockBlob.UploadFromStream(ms);
        }
    }
}

I'm creating the file in memory and then copying it and uploading it to azure.

My "issue" is that for this i need to first save the whole file in memory and only then start copying (it can be an issue if the file is too big and the machine is low on ram).

Ideally i could write directly into azure or as soon as i filled my memory stream buffer i would copy it to azure and then write again on top of it instead of allocating more space in me memory stream buffer.

Is there a way to write directly into Azure? (The objective is to save ram)

Edit:

With the input of the answer by Gaurav Mantri-AIS I came up with this (because i have more than 50000 entries which is the limit of blocks),

public static void ExportCSVToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<RawReportRow> reportEntries)
{
    var blob = container.GetAppendBlobReference($"{fileName}.csv");
    blob.CreateOrReplace();

    blob.AppendText($"Date,StoreId,ItemId,SalesQuantity{Environment.NewLine}");
    foreach (var row in reportEntries)
    {
        var line = $"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"{Environment.NewLine}";
        blob.AppendText(line);
    }
}

The issue with this solution is that it takes too long, from 5 minutes to more than an hour. I'm probably doing something wrong as the AppendBlob should perform well appending, but it doesn't seem to be the case.

Any Idea in how to improve the write speed a little bit?

RagnaRock
  • 2,432
  • 7
  • 34
  • 56
  • Maybe the BlockBlobClient class? https://learn.microsoft.com/en-us/dotnet/api/azure.storage.blobs.specialized.blockblobclient?view=azure-dotnet – jefftrotman May 09 '20 at 15:02
  • There's a max limit to AppendText and Append method in general - it's 100MB per append. also make sure to add blob.Flush call at the last to actually perform the upload. – abhijat_saxena May 17 '20 at 13:26

2 Answers2

1

It is certainly possible to do so. One solution would be to use StringBuilder and keep on adding data to it. Once all the data is added, create a byte array and then a memory stream from it and upload that memory stream.

Here's the sample code (untested though):

    public static void ExportCsvToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<ReportRow> reportEntries)
    {
        using (var ms = new MemoryStream())
        {
            StringBuilder sb = new StringBuilder();
            sb.AppendLine("Date,StoreId,ItemId,SalesQuantity");
            foreach (var row in reportEntries)
            {
                var line = $"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"";
                sb.AppendLine(line);
            }
            var buffer = Encoding.UTF8.GetBytes(sb.ToString());
            ms.Write(buffer, 0, buffer.Length);
            var blockBlob = container.GetBlockBlobReference($"{fileName}.csv");
            ms.Position = 0;
            blockBlob.UploadFromStream(ms);
        }
    }

UPDATE

Assuming you're using SDK version 9.3.3, you can use UploadText method and directly upload the string to Azure Storage. Something like:

    public static void ExportCsvToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<string> reportEntries)
    {
        StringBuilder sb = new StringBuilder();
        sb.AppendLine("Date,StoreId,ItemId,SalesQuantity");
        foreach (var row in reportEntries)
        {
            var line = $"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"";
            sb.AppendLine(line);
        }
        var blockBlob = container.GetBlockBlobReference($"{fileName}.csv");
        blockBlob.UploadText(sb.ToString());
    }

UPDATE 2

Yet another alternative would be to upload each line as a separate block and then finally commit the block list. However please keep in mind that there can be only 50000 blocks in a blob and this method will fail if you have more than 50000 records in your data. To circumvent this limitation you may want to combine certain records and save them as a block.

Here's the sample code:

    public static void ExportCsvToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<string> reportEntries)
    {
        List<string> blockIds = new List<string>();
        CloudBlockBlob blob = container.GetBlockBlobReference(fileName);
        int counter = 0;
        foreach (var row in reportEntries)
        {
            var line = $"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"";
            var blockId = Convert.ToBase64String(Encoding.UTF8.GetBytes(counter.ToString("d6")));
            blob.PutBlock(blockId, new MemoryStream(Encoding.UTF8.GetBytes(line)), string.Empty);
            blockIds.Add(blockId);
            counter++;
        }
        blob.PutBlockList(blockIds);
    }
Gaurav Mantri
  • 128,066
  • 12
  • 206
  • 241
  • by looking at it, it seems it uses twice the memory, since the file is now stored in both the byte[] buffer and the MemoryStream ms – RagnaRock May 09 '20 at 16:14
  • This still makes us pre-build the whole file in the string builder first. If that UploadText method had the option to append instead of rewrite, I could write directly into azure, that would be the ideal scenario – RagnaRock May 11 '20 at 07:10
  • How large is your blob? There's another way but I believe that would be an over-engineered solution. – Gaurav Mantri May 11 '20 at 07:21
  • at the moment just 100mb. I just want to save resources while saving it. Your solution is pointing me to use blob.AppendText(line); instead of the UploadText(line). I'm using blob = container.GetAppendBlobReference($"{fileName}.csv") now and trying to write directly in it... but im getting a InvalidOperationException: Blob type of the blob reference doesn't match blob type of the blob. – RagnaRock May 11 '20 at 07:31
  • If you want to use append blobs instead of block blob, then the above approach should work. But append blobs are quite different than block blobs so you should be aware of the differences between them before using them. In any case, updated my answer with yet another approach. HTH. – Gaurav Mantri May 11 '20 at 07:43
  • With your newest solution i hit a 50000 block limit, since I have around 100000 entries in my report. so I tried this https://pastebin.com/91ASAjTz and its ridiculously slow. I think both solution are slow because the storage account is not prepared for so many blocks to be written at once or appended. There must be another kind of blob that allows this (I hope) – RagnaRock May 11 '20 at 12:11
  • 1
    I give up :). Please post the solution that you end up implementing. I would be very curious to know that. – Gaurav Mantri May 11 '20 at 12:14
1

I'm going to have a go, largely based on Gaurav Mantri-AIS' answer. Because I think you guys are on to something.

Let's combine efforts here... On one hand, you would like to write to to the Blob as soon as possible to limit memory usage. On the other hand we don't want to write every line since that surpasses the block limit. So we need to have an X amount of records in memory before writing it to the blob.

I'm trying some pseudo code here with X at a value of 50. I think this value can (and should) be optimized for memory usage, performance and block count:

public static void ExportCsvToStorageAccount(string fileName, CloudBlobContainer container, IEnumerable<string> reportEntries)
{
    List<string> blockIds = new List<string>();
    CloudBlockBlob blob = container.GetBlockBlobReference(fileName);
    int counter = 0;
    StringBuilder builder = new StringBuilder();
    foreach (var row in reportEntries)
    {
        builder.Append($"\"{row.Date}\",\"{row.StoreId}\",\"{row.ItemId}\",\"{row.SalesQuantity}\"");
        counter++;

        if (counter % 50 == 0)
        {
            var blockId = Convert.ToBase64String(Encoding.UTF8.GetBytes(counter.ToString("d6")));
            blob.PutBlock(blockId, new MemoryStream(Encoding.UTF8.GetBytes(line)), string.Empty);
            builder = new StringBuilder();
            blockIds.Add(blockId);
        }
    }
    // Check if there's anything still in the String Builder and write it
    if (builder.Length != 0)
    {
        var blockId = Convert.ToBase64String(Encoding.UTF8.GetBytes(counter.ToString("d6")));
        blob.PutBlock(blockId, new MemoryStream(Encoding.UTF8.GetBytes(line)), string.Empty);             
    }
    blob.PutBlockList(blockIds);
}

Another thing to take into account is the fact that Azure Function bindings for storage enable you to bind a blob to a Stream. This gives me two things to contemplate:

  • you could use an Azure Function for this
  • it should be possible to get a stream reference to a Blob

EDIT:
I dove into the source for the azure-webjobs-sdk and found that it uses CloudBlobStream. Despite the fact that it is marked as it being legacy, you can still get a CloudBlobStream by calling OpenWriteAsync on a CloudBlockBlob. I didn't have time to test an example, but I did find this example here on SO: Uploading a file to Azure Blob on the fly.

public async Task<Stream> GetWriteStreamAsync(string storagePath, string contentType)
{
    var blockBlob = blobContainer.GetBlockBlobReference(storagePath);
    blockBlob.Properties.ContentType = contentType;
    CloudBlobStream bb = await blockBlob.OpenWriteAsync();
    return bb;
}
rickvdbosch
  • 14,105
  • 2
  • 40
  • 53