1

I'm working on a project where we're receiving data from multiple sources, that needs to be saved into various tables in our database.

Fast.

I've played with various methods, and the fastest I've found so far is using a collection of TableValue parameters, filling them up and periodically sending them to the database via a corresponding collection of stored procedures.

The results are quite satisfying. However, looking at disk usage (% Idle Time in Perfmon), I can see that the disk is getting periodically 'thrashed' (a 'spike' down to 0% every 13-18 seconds), whilst in between the %Idle time is around 90%. I've tried varying the 'batch' size, but it doesn't have an enormous influence.

  1. Should I be able to get better throughput by (somehow) avoiding the spikes while decreasing the overall idle time?
  2. What are some things I should be looking out to work out where the spiking is happening? (The database is in Simple recovery mode, and pre-sized to 'big', so it's not the log file growing)
  3. Bonus: I've seen other questions referring to 'streaming' data into the database, but this seems to involve having a Stream from another database (last section here). Is there any way I could shoe-horn 'pushed' data into that?

enter image description here

Benjol
  • 63,995
  • 54
  • 186
  • 268

3 Answers3

1

A very easy way of inserting loads of data into an SQL-Server is -as mentioned- the 'bulk insert' method. ADO.NET offers a very easy way of doing this without the need of external files. Here's the code

var bulkCopy = new SqlBulkCopy(myConnection);
bulkCopy.DestinationTableName = "MyTable";
bulkCopy.WriteToServer (myDataSet);

That's easy.

But: myDataSet needs to have exactly the same structure as MyTable, i.e. Names, field types and order of fields must be exactly the same. If not, well there's a solution to that. It's column mapping. And this is even easier to do:

bulkCopy.ColumnMappings.Add("ColumnNameOfDataSet", "ColumnNameOfTable");

That's still easy.

But: myDataSet needs to fit into memory. If not, things become a bit more tricky as we have need a IDataReader derivate which allows us to instantiate it with an IEnumerable.

You might get all the information you need in this article.

alzaimar
  • 4,572
  • 1
  • 16
  • 30
  • The thing is, if my data has not all arrived yet, I can't put it into memory, even if it *would* fit. Could I use the IDataReader with an - potentially - blocking IEnumerable? – Benjol Oct 09 '13 at 09:32
  • Of course, but when it arrives, you have bottleneck and the bulk insert widens it quite a bit. What exactly do you mean by 'blocking IEnumerable'? You can build a 'data pump' if that is, what you want: You put data in one end and it is transferred to the server the fastest possible way. If you add a queue, there won't be any overflow or blocking issues. – alzaimar Oct 09 '13 at 09:53
  • Maybe I'm not understanding properly, or failing to communicate properly. What I mean by 'blocking' is that if the bulk insert *is* fast enough, eventually it will do a `MoveNext` and have to wait, because the data won't have arrived yet. I thought that might be a problem, but maybe it's not. I see it as a conflict between the BulkInsert wanting to 'pull' the data, and my data sources wanting to 'push' it. – Benjol Oct 09 '13 at 10:24
  • 1
    That is why you would implement an input queue where data gets filled up. As soon as it gets filled up, the bulk insert starts. You'd propably use two threads for that. I've done something similar already and can guarantee that there is no blocking of the input data. But, you have to do some programming yourself. The push/pull problem (yes, you're right) can be solved using the classes introduced in the article I mentioned. – alzaimar Oct 09 '13 at 13:20
  • Just one more question (I hope), I'm having difficulty working out what kind of queue *would* 'block' the IEnumerable when empty, and would allow me to add new elements while the BulkCopy is enumerating. – Benjol Oct 16 '13 at 05:37
1

Building on the code referred to in alzaimar's answer, I've got a proof of concept working with IObservable (just to see if I can). It seems to work ok. I just need to put together some tidier code to see if this is actually any faster than what I already have.

(The following code only really makes sense in the context of the test program in code download in the aforementioned article.)

Warning: NSFW, copy/paste at your peril!

private static void InsertDataUsingObservableBulkCopy(IEnumerable<Person> people, 
                                                      SqlConnection connection)
{
    var sub = new Subject<Person>();

    var bulkCopy = new SqlBulkCopy(connection);
    bulkCopy.DestinationTableName = "Person";
    bulkCopy.ColumnMappings.Add("Name", "Name");
    bulkCopy.ColumnMappings.Add("DateOfBirth", "DateOfBirth");

    using(var dataReader = new ObjectDataReader<Person>(people))
    {
        var task = Task.Factory.StartNew(() =>
        {
            bulkCopy.WriteToServer(dataReader);
        });
        var stopwatch = Stopwatch.StartNew();
        foreach(var person in people) sub.OnNext(person);
        sub.OnCompleted();
        task.Wait();
        Console.WriteLine("Observable Bulk copy: {0}ms",
                           stopwatch.ElapsedMilliseconds);
    }
}
Community
  • 1
  • 1
Benjol
  • 63,995
  • 54
  • 186
  • 268
  • This *IS* faster (for my use case), and seems to avoid the nasty disk spikes (there's a lot of activity on disk, but it's more evenly distributed). – Benjol Oct 16 '13 at 04:52
0

It's difficult to comment without knowing the specifics, but one of the fastest ways to get data into SQL Server is Bulk Insert from a file.

You could write the incoming data to a temp file and periodically bulk insert it.

Streaming data into SQL Server Table-Valued parameter also looks like a good solution for fast inserts as they are held in memory. In answer to your question, yes you could use this, you just need to turn your data into a IDataReader. There's various ways to do this, from a DataTable for example see here.

If your disk is a bottleneck you could always optimise your infrastructure. Put database on a RAM disk or SSD for example.

Community
  • 1
  • 1
TheCodeKing
  • 19,064
  • 3
  • 47
  • 70
  • Yeah, I guess I should try that. What put me off is that the data is heterogeneous, so I'd have to have a different file for each type of data. That doesn't make it impossible, just a bit more fiddly. And it would also be nice to have the data available sooner than "when it's all finished"... – Benjol Oct 09 '13 at 07:04
  • Yeah it really depends on specifics, but I think you'll find Bulk Insert the fastest way to get large chunks of data into the database. Downside is you would buffer to file and periodically update rather than live update. – TheCodeKing Oct 09 '13 at 07:08
  • Just edited based on your link. It does looking the next best thing to bulk insert, and should work in your case. You'll still need to buffer, but at least there's no files, and less fiddly. – TheCodeKing Oct 09 '13 at 07:24
  • Note: You can directly insert data into an existing table. It is appended using `SqlBulkInsert`. Indexes greatly decrease bulk insert performance, but deactivating and reactivating them also takes time, so you have to play around with this. @TheCodeKing: You can configure the `SqlBulkCopy` to buffer for you and blow e.g. 5000 or 10000 rows at once. Using a typed `IDataReader` will also help increasing speed. – alzaimar Oct 09 '13 at 07:59
  • @alzaimar, yeah I've already done quite a lot of 'optimisation' at sql level (clustering pks, dropping other constraints, using transactions, setting recovery mode to simple). Could you maybe expand on the `SqlBulkCopy` and `IDataReader` options in an answer? – Benjol Oct 09 '13 at 08:17
  • @alzaimar nice hadn't used `SQLBulkCopy` in anger. It looks like a nice alternative to `BulkInsert`. – TheCodeKing Oct 09 '13 at 19:49