2

Background:File upload

My scenario: I need to upload larget amount of files to Azure blob, it maybe 10,000 to 100,000 files. each file sized 10KB-50KB.

I used solution in discussion above, I see files quickly uploading, however, I have so many files so that I found that my application leads to very high CPU usage, always 100%...what's worse is that next step I need to run hundreds of processes, I mean I need to run hundreds of processes, each process needs to upload 10,000 files or more. I have tested it until now, unfoutuntely, I see many weird problems, like exception "connection is closed" etc...

Do you guys have ideas to decrease CPU usage of Tasks...

Community
  • 1
  • 1
Paul Zhou
  • 193
  • 2
  • 12

3 Answers3

0

The problem here that I see is that you spin so many threads that you will overload the machine resource-wise by simply having to manage all the queued threads even if technically they don't try to run all at the same time. They will take RAM and in the absence of RAM, will use SWAP space - which will then bring down the machine in a blaze of non-glory.

I would use a queue (azure queue, msmq, Systems.Collections.Queue) to queue up all the objects, use a limited number of Threads that will process the file using Async methods described in your background link and then the thread is done executing check for the next item in the queue and process that one. My recommendation, is to use a non-memory queue - I will explain below. The main benefit is to save ram so that your software doesn't crash or slow down because the queue is too big.

Parallel.ForEach and such are great time savers but can really ruin the performance of your machine when you are dealing with a lot of items - and if the machine ever goes down then you cannot recover from it unless you have a checkpoint somewhere. Using a persistent queue will allow you to properly manage not just machine resources but also where you are in the process.

You can then scale this across multiple machines by using a persistent queue like MSMQ or if in the cloud, Azure queues. If you use a service that checks how big the azure queue is, you can even bring up instances from time to time to reduce the load and then terminate the extra instances.

This is the scenario that I would implement:

Use the standard ThreadPool size When you detect a new file/batch - submit to the queue Have an event fire every time you insert a new item in the queue (if memory queue) Have a process check the queue (if persistent queue) If a new item is in the queue, first check if you have space in the ThreadPool ignore if you don't (use a PEEK approach so you don't remove the item) - Add a worker to ThreadPool if there is space The process thread (which runs under ThreadPool) should execute and then check if you have another item in the queue - if not - the thread dies, which is fine

Using this method, you could run this with 1 machine, or 50,000 - provided you use a persistent queue for more than 1 machine you won't have any problems. Of course, make sure you do a proper job of testing for duplicate items if you are using Azure Queues as you could be handed a queued item that's been given to another machine.

It's a simple approach, scalable and can if using a persistent queue (even a file system) recover from failure. It will not however overload the machine by abusing the resources by forcing it to manage a ThreadPool with 1 million+ items.

Hope this helps

Karell Ste-Marie
  • 1,022
  • 1
  • 10
  • 22
0

Simply, go for a thread pool implementation, let there be 20 threads (because that's probably around what your network bandwidth can handle simultaneously), each upload is going to take 2-3 seconds, it is going to take around 4-5 hours, which is acceptable. Make sure that you don't share storage or container instances between uploads, that might cause the "connection is closed" errors.

ahmet alp balkan
  • 42,679
  • 38
  • 138
  • 214
  • Thanks. Actually, all files are uploaded to same container, but different folder like "http://xxx.blob.core.windows.net/testcontainer/projectA/xxx.xml". all files will be uploaded to same container, different projects. You said, we can use thread pool to control task behaviors, right? could you please discribe it in detail? – Paul Zhou Oct 15 '12 at 07:36
  • Don't you have the files in your local? Are you going to move files inside the Azure Storage? – ahmet alp balkan Oct 18 '12 at 23:22
  • There is misunderstanding. I mean I am uploading local files to Azure blob, the blob uri will be like xxx.blob.core.windows.net/testcontainer/projectA/xxx.xml – Paul Zhou Oct 22 '12 at 01:07
0

I'm a Microsoft Technical Evangelist and I have developed a sample and free tool (no support/no guarantee) to help in these scenarios.

The binaries and source-code are available here: https://blobtransferutility.codeplex.com/

The Blob Transfer Utility is a GUI tool to upload and download thousands of small/large files to/from Windows Azure Blob Storage.

Features:

  • Create batches to upload/download
  • Set the Content-Type
  • Transfer files in parallel
  • Split large files in smaller parts that are transferred in parallel

The 1st and 3rd feature is the answer to your problem.

You can learn from the sample code how I did it, or you can simply run the tool and do what you need to do.

  • Thanks for posting your answer! Please be sure to read the [FAQ on Self-Promotion](http://stackoverflow.com/faq#promotion) carefully. Also note that it is *required* that you post a disclaimer every time you link to your own site/product. – Andrew Barber Apr 05 '13 at 20:36