0

I'm trying to figure out the best way to batch insert about 37k rows into my Sql Server using DAPPER.

My problem is that when I use Parallel.ForEach - the number of connections to the database increases over a short period of time - finally hitting nearly or about 100 ... which gives connection pool errors. If I force the max degree of parall then it's hit that max number and stays there.

Setting the maxdegree feels wrong.

It currently is doing about 10-20 inserts a second. This is also in a simple Console App - so there's no other database activity besides what's happening in my Parallel.ForEach loop.

Is using Parallel.ForEach the incorrect thing in this case because this is not-CPU bound?

Should I be using async/await ? If so, what stopping this from doing hundreds of db calls in one go?

Sample code which is basically what I'm doing.

var items = GetItemsFromSomewhere(); // Returns 37K items.

Parallel.ForEach(items => item)
{
    using (var sqlConnection = new SqlConnection(_connectionString))
    {
        var result = sqlConnection.Execute(myQuery, new { ... } );
    }
}

My (incorrect) understanding of this was that there should on be about 8 or so connections at any time to the db. The Connection Pool will release the connection (which remains instantiated in the Connection Pool, waiting to be used). And if the Execute takes .. i donno .. lets say even a 1 second (the longest running time for an insert was about 500ms .. and that's 1 in every 100 or so) ... that's ok .. that thread is blocked and chills until the Execute completes. Then the scope completes (and Dispose is auto called) and the connection closed. With the connection closed, the Parallel.ForEach then grabs the next item in the collection, goes to the connection pool and then grabs a spare connection (remember - we just closed one, a split second ago) ... rinse.repeat.

Is this wrong?

Notes:

  • .NET 4.5
  • Sql 2012
  • Console app.
  • Using Dapper.NET for sql code.
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Pure.Krome
  • 84,693
  • 113
  • 396
  • 647
  • 2
    Using multiple connections makes blocking issues (much) worse, not better. Use SqlBulkCopy to insert entries in the database with minimal logging. Or use SQL Server's Integration Services to import the entries directly into the database, from any number of sources. 37K is a very small number of rows. – Panagiotis Kanavos Sep 28 '15 at 07:58
  • For the purpose of this question, lets ignore the actual SQL QUERY. I could use SSIS but I don't have access to that etc. I'm needing to use a console app to do all of this. Also, i'm trying to find out why my connection pool connections are slowly increasing over (a short amount of) time. – Pure.Krome Sep 28 '15 at 07:59
  • 1
    While Dapper is great for querying, I'd use the native Bulk Insert for that kind of job. – Alex Sep 28 '15 at 08:00
  • I'm also happy to accept answers that also show non-dapper solutions, btw - if i can do it in a console app. But i'm _very curious_ to the answer to my connection pooling prob. – Pure.Krome Sep 28 '15 at 08:01
  • 1
    `Parallel` is for parallelizing CPU tasks, not I/O. There's no reason to use multi-threading for SQL commands. – Luaan Sep 28 '15 at 08:02
  • @Luaan - i was hoping to say: i have lots of rows to insert. Go and do 10 inserts at the same time instead of 1 at a time. – Pure.Krome Sep 28 '15 at 08:03
  • The thing is, that's exactly the kind of thing that isn't well parallelizable even on the SQL Server itself - 10-20 rows per second is *horrible* performance - SQL server can easily do thousands (or tens of thousands). Ideally, you want to batch the inserts together, so that they execute as a single command (or e.g. 100 rows at a time). – Luaan Sep 28 '15 at 08:04
  • 1
    @Pure.Krome you misunderstand how the database works. 10 inserts in parallel means 10x times logging, and 10 more connections blocking on the same table. Parallel loading is important *ONLY* once you max the performance of a single connection *AND* you have available disk & network performance – Panagiotis Kanavos Sep 28 '15 at 08:04
  • You can improve your insert performance by 100-1000 times if you send 100-1000 rows at once with SqlBulkCopy. This will be minimally logged only once, and there *won't* be any lock waiting on the table or the underlying disk – Panagiotis Kanavos Sep 28 '15 at 08:06
  • That said, your understanding of connection pooling *is* correct - it shouldn't give you those errors. That means that somewhere, you're not disposing some of the objects (`SqlConnection`, `SqlDataReader`... who knows), or you're in some transaction scope. If you're in a distributed transaction, the connections cannot be reused until the transaction is over. – Luaan Sep 28 '15 at 08:06
  • 1
    To prevent head blocking and other associated locks, there's a side-effect to overuse of the pool; `When connection pooling is enabled, and if a timeout error or other login error occurs, an exception will be thrown and subsequent connection attempts will fail for the next five seconds, the "blocking period". If the application attempts to connect within the blocking period, the first exception will be thrown again. After the blocking period ends, another connection failure by the application will result in a blocking period that is twice as long as the previous blocking period.` – dash Sep 28 '15 at 08:09
  • (see https://msdn.microsoft.com/en-us/library/vstudio/8xx3tyca%28v=vs.100%29.aspx) – dash Sep 28 '15 at 08:10
  • By the way, Dapper seems to have support for command batching - have you tried that? It will still be worse than a bulk insert, but you'll get rid of a lot of the overhead. – Luaan Sep 28 '15 at 08:13
  • Parallel loading is used for GB-sized inputs to load data in different table partitions, to utilize the IO bandwidth of multiple disks. SqlBulkCopy itself is limited by the client computer's bandwidth. Copying the raw data to a temporary disk on the server and loading them from there avoids that bottleneck as well. – Panagiotis Kanavos Sep 28 '15 at 08:14
  • @Luaan T-SQL INSERT allows multiple values as well, so the OP could insert multiple values at once. Or (probably better), pass a table-valued parameter. BULK INSERT though ensures minimal logging - instead of logging individual operations, only the changed data pages are logged. This can be a huge performance gain – Panagiotis Kanavos Sep 28 '15 at 08:16
  • I agree with all of the comments about better ways to insert data into the database; however to answer @Pure.Krome's other question, I believe one of his tasks is probably throwing an exception; this will cause the blocking behaviour described in the article on connection pooling and will lead to subsequent tasks (which will be unable to retrieve a new connection) to immediately fail. Add some exception handling and logging to your task logic to see if this is the case. – dash Sep 28 '15 at 08:17
  • I think uploading big files must be limited as an upload alone. All others like bulk processing must be handled separately. I suggest you either create a separate project that uses SqlBulkCopy (which I think you already did) just ensure that the bulk copy program is not within the web project as it might contribute to the pool. the second option is an SQL task job that uses OpenRowSet, if you're into transforming the file into something else before the bulk, consider inserting first the raw before any transformation afterwards. – ken lacoste Sep 28 '15 at 08:18
  • My console app is 100% separate and runs on my own machine. I can see it has it's OWN connection to the Sql Server (i use my own credentials, etc) – Pure.Krome Sep 28 '15 at 08:20
  • by the way, when does the Timeout happens? during upload or during file import? – ken lacoste Sep 28 '15 at 08:21
  • Upload. there is no file import. i have 37k items in a simple `IList<..>` .. then I need to get that data into dat Sql Server. – Pure.Krome Sep 28 '15 at 08:22
  • So you're transforming / reading the uploaded file in the web instance correct? – ken lacoste Sep 28 '15 at 08:23
  • @kenlacoste nope. I've got a console app. Nothing to do with a website. I read in some data. Sprinkle :sparkles: over it. Massage it (and other data) into a nice projection. Then Insert those 37K projected data rows into Sql Server. – Pure.Krome Sep 28 '15 at 08:27
  • You might want to share your Upload code, i'm wondering how this upload makes its way even before doing the import. – ken lacoste Sep 28 '15 at 08:32
  • @kenlacoste How I get the data is totally irrelevant. It's what i'm trying to do with it, now that I have it => insert it into a db. The consensus is that I use SqlBulkInsert, which is exactly what I'm reading up about. Urgh -> I was hoping that I'd never have to work with a `DataTable` ever again ... looks like I'm going back to the 90's! – Pure.Krome Sep 28 '15 at 08:35
  • Well you've mentioned that the Timeout happens in Upload thus I will not suspect your import to DB. – ken lacoste Sep 28 '15 at 08:41
  • Upload ment -- when the data is uploading to the db (ie. the INSERTING) – Pure.Krome Sep 28 '15 at 09:07
  • Beware that errors encountered during `SqlBulkCopy` can corrupt a pooled connection causing an error on the next (completely unrelated) command that uses the same connection from the pool. – Vedran Jul 30 '18 at 10:49

2 Answers2

1

First of all: If it is about performance, use SqlBulkCopy. This works with SQL-Server. If you are using other database servers, they might have their own SqlBulkCopy-solution (Oracle has one).

SqlBulkCopy works like a bulk-select: One state opens one connection and streams all the data from the server to the client. With an insert, it works the other way arround: It streams all the new records from the client to the server.

See: https://msdn.microsoft.com/en-us/library/ex21zs8x(v=vs.110).aspx

If you insist of using parallellism, you might want to consider the follow code:

void BulkInsert<T>(object p)
{
    IEnumerator<T> e = (IEnumerator<T>)p;
    using (var sqlConnection = new SqlConnection(_connectionString))
    {
        while(true)
        {
            T item;
            lock(e)
            {
                if (!e.MoveNext())
                    return;
                item = e.Current;
            }
            var result = sqlConnection.Execute(myQuery, new { ... } );
        }
    }
}

Now create your own threads and invoke this method on these threads with one and the same parameter: The iterator which runs through your collection. Each threat opens its own connection once, starts inserting, and after all items are inserted, the connection is closed. This solutions uses as many connections as your created threads.

PS: Multiple variants of above code are possible . You could call it from background threads, from Tasks, etc. I hope you get the point.

Martin Mulder
  • 12,642
  • 3
  • 25
  • 54
  • Do you have some sample code using `SqlBulkInsert` ? (I'm starting to prefer that over iterating and parallel iterating. – Pure.Krome Sep 28 '15 at 08:18
  • 1
    If have found two website, which might help: http://dotnetmentors.com/c-sharp/bulk-upload-into-sql-server-using-sqlbulkcopy-and-c-sharp.aspx and http://www.codeproject.com/Articles/18418/Transferring-Data-Using-SqlBulkCopy. Note that a SqlBulkCopy rely on a datasource like IDataReader or DataTable. So, somehow, you must first put your collection into a datatable OR create a datareader which reads from your collection. – Martin Mulder Sep 28 '15 at 08:23
  • @Pure.Krome There are a lot of examples of using [SqlBulkCopy ](https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy(v=vs.110).aspx), including the documentation itself – Panagiotis Kanavos Sep 28 '15 at 08:24
  • This code doesn't really execute the statements in parallel. It's essentially an expanded `foreach` loop using a single connection. Parallel execution would need partitioning the rows to a small number of batches, then insert each of these sets in parallel. – Panagiotis Kanavos Sep 28 '15 at 08:35
  • @PanagiotisKanavos: You are correct. If forgot the while-loop. Corrected it. Thanx! – Martin Mulder Sep 28 '15 at 08:42
  • Just used `SqlBulkInsert` went from 1 hour down to just under 1 min. – Pure.Krome Sep 28 '15 at 10:00
  • @Pure.Krome: Good to hear! I love such performance increases! – Martin Mulder Sep 28 '15 at 11:25
0

You should use SqlBulkCopy instead of inserting one by one. Faster and more efficient.

https://msdn.microsoft.com/en-us/library/ex21zs8x(v=vs.110).aspx

credits to the answer owner Sql Bulk Copy/Insert in C#

Community
  • 1
  • 1
Jeff
  • 760
  • 1
  • 12
  • 26
  • 2
    If you've found a *duplicate* question, you should mark this question as duplicate. Copying a link to an already existing question creates clutter and makes it harder to find the correct answer the next time someone searches for the same thing – Panagiotis Kanavos Sep 28 '15 at 08:08
  • got it. noted that one @Panagiotis Kanavos – Jeff Sep 28 '15 at 08:12
  • @PanagiotisKanavos: The question is not duplicate. But one possible solution (and most likely the best solution) is the same answer as another and totally different question. – Martin Mulder Sep 28 '15 at 08:26
  • @MartinMulder this is really the [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) - OP has performance issues with inserts, assumes parallel operations are required and asks about a problem with parallel operations. They aren't and even if the pool didn't crash, the server would slow to a crawl due to contention – Panagiotis Kanavos Sep 28 '15 at 08:31
  • @PanagiotisKanavos: True. But I doubt that the OP is the first one who tries to solve this problem with parallellism. I can imagine that other people will do a search on "Connection timeout" and "bulk insert" and NOT: "SqlBulkCopy". This question is relevant because this way those people can find the correct answer. Mark this question as duplicate (and perhaps it gets removed in the future), people will not find the correct answer anymore. The XY problem page (and solution) is a good story for the OP, not an argument to close questions. – Martin Mulder Sep 28 '15 at 08:39