7

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.

I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.

I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...

Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?

kyliod
  • 125
  • 1
  • 8

2 Answers2

5

I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.

I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.

In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc

So doing several large bulk inserts at the same time will probably cause one to time out.

Matt Warren
  • 10,279
  • 7
  • 48
  • 63
  • 1
    Ass Matt says. The throughput feels about right to me. Ensure that you have no indexes on your tables apart from the clustered index. – Chris J.T. Auld Mar 02 '12 at 18:24
  • 4
    I ended up inserting into temporary tables in parallel, and then doing a insert-into from those temporary tables into the main tables (in serial). That seemed much faster to me, as the insertion from the temporary tables took ~4-5 minutes for about 2million rows. – kyliod Mar 05 '12 at 21:08
2

It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.

Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.

This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

Carlos Mendes
  • 1,900
  • 1
  • 18
  • 33
Herve Roggero
  • 5,149
  • 1
  • 17
  • 11