9

I am working on an application where file uploads happen often, and can be pretty large in size.

Those files are being uploaded to a Web API, which will then get the Stream from the request, and pass it on to my storage service, that then uploads it to Azure Blob Storage.

I need to make sure that:

  • No temp files are written on the Web API instance
  • The request stream is not fully read into memory before passing it on to the storage service (to prevent OutOfMemoryExceptions).

I've looked at this article, which describes how to disable input stream buffering, but because many file uploads from many different users happen simultaneously, it's important that it actually does what it says on the tin.

This is what I have in my controller at the moment:

if (this.Request.Content.IsMimeMultipartContent())
{
    var provider = new MultipartMemoryStreamProvider();
    await this.Request.Content.ReadAsMultipartAsync(provider);
    var fileContent = provider.Contents.SingleOrDefault();

    if (fileContent == null)
    {
        throw new ArgumentException("No filename.");
    }

    var fileName = fileContent.Headers.ContentDisposition.FileName.Replace("\"", string.Empty);
    
    // I need to make sure this stream is ready to be processed by 
    // the Azure client lib, but not buffered fully, to prevent OoM.
    var stream = await fileContent.ReadAsStreamAsync();
}

I don't know how I can reliably test this.

EDIT: I forgot to mention that uploading directly to Blob Storage (circumventing my API) won't work, as I am doing some size checking (e.g. can this user upload 500mb? Has this user used his quota?).

bubbleking
  • 3,329
  • 3
  • 29
  • 49
Jeff
  • 12,085
  • 12
  • 82
  • 152
  • Have you tried copying the input stream directly to the blob storage? – Yuval Itzchakov May 04 '15 at 13:33
  • Thats what I am doing, but I need to make sure that I am not fully buffering the input stream before blob storage client starts uploading, and I don't know how to test that it's actually happening. – Jeff May 04 '15 at 13:35
  • Have you tried profiling your app to see if it's buffering it before the read? – Yuval Itzchakov May 04 '15 at 13:41
  • Get a [memory profiler](http://stackoverflow.com/questions/399847/net-memory-profiling-tools) and test your app. – Yuval Itzchakov May 04 '15 at 13:52
  • I've found that the file is indeed copied to memory before sending it off to Azure. This is a problem. – Jeff May 04 '15 at 15:13

2 Answers2

12

Solved it, with the help of this Gist.

Here's how I am using it, along with a clever "hack" to get the actual file size, without copying the file into memory first. Oh, and it's twice as fast (obviously).

// Create an instance of our provider.
// See https://gist.github.com/JamesRandall/11088079#file-blobstoragemultipartstreamprovider-cs for implementation.
var provider = new BlobStorageMultipartStreamProvider ();

// This is where the uploading is happening, by writing to the Azure stream
// as the file stream from the request is being read, leaving almost no memory footprint.
await this.Request.Content.ReadAsMultipartAsync(provider);

// We want to know the exact size of the file, but this info is not available to us before
// we've uploaded everything - which has just happened.
// We get the stream from the content (and that stream is the same instance we wrote to).
var stream = await provider.Contents.First().ReadAsStreamAsync();

// Problem: If you try to use stream.Length, you'll get an exception, because BlobWriteStream
// does not support it.

// But this is where we get fancy.

// Position == size, because the file has just been written to it, leaving the
// position at the end of the file.
var sizeInBytes = stream.Position;

Voilá, you got your uploaded file's size, without having to copy the file into your web instance's memory.

As for getting the file length before the file is uploaded, that's not as easy, and I had to resort to some rather non-pleasant methods in order to get just an approximation.

In the BlobStorageMultipartStreamProvider:

var approxSize = parent.Headers.ContentLength.Value - parent.Headers.ToString().Length;

This gives me a pretty close file size, off by a few hundred bytes (depends on the HTTP header I guess). This is good enough for me, as my quota enforcement can accept a few bytes being shaved off.

Just for showing off, here's the memory footprint, reported by the insanely accurate and advanced Performance Tab in Task Manager.

Before - using MemoryStream, reading it into memory before uploading

Before

After - writing directly to Blob Storage

After

Jeff
  • 12,085
  • 12
  • 82
  • 152
7

I think a better approach is for you to go directly to Azure Blob Storage from your client. By leveraging the CORS support in Azure Storage you eliminate load on your Web API server resulting in better overall scale for your application.

Basically, you will create a Shared Access Signature (SAS) URL that your client can use to upload the file directly to Azure storage. For security reasons, it is recommended that you limit the time period for which the SAS is valid. Best practices guidance for generating the SAS URL is available here.

For your specific scenario check out this blog from the Azure Storage team where they discuss using CORS and SAS for this exact scenario. There is also a sample application so this should give you everything you need.

Rick Rainey
  • 11,096
  • 4
  • 30
  • 48
  • You can still use this solution. In your method that generates the SAS URL you can also return back any data quotas you are maintaining for the user such as how much storage is left. In your JavaScript, add some logic to see if your byte array is larger than the quota you returned back for the user and if so show an error on the client. – Rick Rainey May 04 '15 at 14:42
  • That's a problem because 3rd parties will be integrating with my API, so nothing is stopping them from ignoring the quotas. Never trust the client. :) – Jeff May 04 '15 at 14:44
  • Yes, but you don't incur any ingestion costs for this and the storage costs are super cheap. So, I would recommend validating this on server side as well. – Rick Rainey May 04 '15 at 14:46
  • I can't validate it server-side because the stream does not touch my server. Another thing is that I am storing a "reference" to the file in a SQL database. It'll have to go through my server. – Jeff May 04 '15 at 14:48
  • I'm thinking of a background job to do this. Somewhere you are already doing this because you know what the quota for the user is. You could drop a message into a queue each time a user uploads a file to kickoff a job that checks the users quota. Anyway, just a few ideas to think about. – Rick Rainey May 04 '15 at 14:51
  • So basically send a pre-upload request, telling the system that a file is being uploaded, whereafter the client must upload it directly to storage? It would probably work, but could produce inconsistencies (Azure failing, bad requests, etc). – Jeff May 04 '15 at 14:53
  • Essentially. Yes, you will have to add some additional logic but it wouldn't be too much effort. Since you said these uploads are frequent and large, IMO, it's worth the effort to relocate the load that you would otherwise be putting on your Web API server. – Rick Rainey May 04 '15 at 14:57
  • "Frequent and large" is a worst-case scenario. How would I deal with inconsistencies though? And, again, how do I deal with clients that upload more than what the quota allows? I don't see how I can enforce this. Storage may be super cheap, if you're financially backed. :) – Jeff May 04 '15 at 14:59
  • Have you looked at the pricing for Azure storage? Even if a user circumvented your JS logic and uploaded a few hundred MB's of data beyond the quota, your only talking about a few pennies. On the server side you would validate the data uploaded (file reference), update your user's quota, and not allow any further uploads if it was exceeded. If you wanted your pennies back you could delete the blob that exceeded the user's quota so you're not paying for the storage. – Rick Rainey May 04 '15 at 15:13
  • How will the API know when the upload is completed? I am sending out a SignalR message when that happens. – Jeff May 04 '15 at 15:16
  • I have to agree with Rick here. I've moved to direct uploading to Azure storage recently and it's been a massive improvement. First I get the user to add the file via input. Then I make an api call to my server with the file size. If their quota is ok I return a SAS uri. Then I upload the file in chunks and in parallel through Cloudfront CDN direct to Azure storage. Fastest possible experience and less code. – GFoley83 Dec 20 '15 at 01:43