2

I have a SQL Server SP and would like to execute that for around 10 million user record sets to process the user info and update DB. Currently its taking few millisecond for 1 user to update in DB. Since, it has to process million records what are the best ways to do it?

I am thinking do it from the C# application using multithreaded or parallel foreach. Below is the basic code to run the SP, how can I use it to run it in multithreaded mode to do the processing job faster?

 string connString = "data source=.\SQLEXPRESS;initial catalog=Test;integrated security=True;";
            SqlConnection conn = new SqlConnection(connString);
            conn.Open();
            SqlCommand cmd = new SqlCommand("ETL_TEST", conn);
            cmd.CommandType = CommandType.StoredProcedure;

            cmd.Parameters.Add(new SqlParameter("@user_id", '12345'));
            cmd.ExecuteReader();
            conn.Close();
rajcool111
  • 657
  • 4
  • 14
  • 36
  • multithreading this will end up being a LOT slower. You would effectively have to change this to be row by row. Or is this process already doing this row by row? – Sean Lange Feb 17 '16 at 21:58
  • there is one more SP written to do row by row operation and it takes lot of time. – rajcool111 Feb 17 '16 at 22:00
  • @SeanLange: No it won't be a lot slower. Up to a point, multiple concurrent requests to SQL Server speed up overall throughput. – Eric J. Feb 17 '16 at 22:00
  • What user information i.e. columns,datatype are you trying to update? Also where is the update information coming from? – nobody Feb 18 '16 at 00:36

3 Answers3

3

how can I use it to run it in multithreaded mode to do the processing job faster?

Parallel.ForEach() is a very reasonable starting point to accomplish your goal. Up to a point, sending concurrent requests to SQL Server can improve overall throughput.

Put your code into the action of Parallel.ForEach(). Use using statements to ensure that things are properly disposed (right now you will leave a connection open if an exception is thrown before it is closed).

Note that the .NET Framework will attempt to find an optimal number of threads (and different versions of the framework may end up selecting different numbers). It is possible that .NET will select a number of threads that is either

  • Too low (which you cannot influence, but you could solve by creating and managing tasks explicitly)
  • Too high, resulting in inefficient execution in SQL Server (due to IO or lock contention, high CPU, etc). In this case, you can use the MaxDegreeOfParallelism property to limit the number of threads. Don't forget to adjust the property if you move to a more or less powerful SQL Server.

Note that a multi-threaded solution, while probably more efficient than a single-threaded one, is still very chatty. It makes a request to the database for each user record. In this type of situation, I have achieved order-of-magnitude performance improvements by sending a batch of work over to SQL Server rather than a single work item. You'll likely see the greatest gains by redesigning the stored procedure to accept a number of users at once.

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • It is very important to stress that the TPL will generally be completely off from the optimal number of threads for IO work. The number of threads will (literally) trend to infinity over time. This is easy to reproduce with a `sleep(100)`. +1 because you said that but I wanted to stress that even more. – usr Feb 17 '16 at 22:32
  • @usr: In my experience TPL does a reasonable but not perfect job of selecting the number of threads. In doing the work related to the linked question, I did have to specifically limit the degree of parallelism. Can you cite an example for when TPL will tend toward a very large number of threads? Would be curious to understand that situation. – Eric J. Feb 17 '16 at 23:00
  • https://github.com/dotnet/coreclr/issues/1754 Really bad. Unlimited thread creation in all realistic IO scenarios. – usr Feb 17 '16 at 23:19
  • That link complains that core is too *slow* in creating new threads when the workload profile changes drastically. It doesn't mention that an arbitrary number of threads would ever be created. – Eric J. Feb 17 '16 at 23:44
  • Quoting: `The memory use is proportional to the number of threads (here: 731).`. Both problems are present. Slow increase but infinite increase. That picture is scary! Linear growth of thread count over time. Extrapolate that over night. This is the kind of problem that takes production systems down. – usr Feb 18 '16 at 00:03
  • These threads are created on the thread pool, which imposes limits that have not been hit in the example http://stackoverflow.com/a/145323/141172. The number of threads that can be created in that scenario is further limited by the configured max requests https://msdn.microsoft.com/en-us/library/dd560842(v=vs.110).aspx. You can limit the maximum number of thread pool threads https://msdn.microsoft.com/en-us/library/system.threading.threadpool.setmaxthreads(v=vs.110).aspx, though I doubt many developers do that until they run into problems. Not saying it's not a problem but it can be managed. – Eric J. Feb 18 '16 at 00:59
  • It absolutely can be managed but it also means that a simple loop such as `Parallel.ForEach(fileNames, fn => WriteFile(fn))` will lead to infinite amounts of threads and also completely monopolize the thread-pool in the process. – usr Feb 18 '16 at 06:14
  • @Eric Yes, Parallel.ForEach did do a great job on doing it faster from my C# app, but with adding MaxDegreeOfParallelism property though. – rajcool111 Feb 18 '16 at 07:11
0

Since, it has to process million records what are the best ways to do it?

If you are asking what is the best way to make the process parallel, then @Eric's answer looks promising. I would agree that generally speaking, "Up to a point, sending concurrent requests to SQL Server can improve overall throughput.", as long as we really stress the "Up to a point" part. But "that point" could easily be just a few running at a time (depending on the logic and schema), which isn't the type of gain you are looking for.

If you are asking what is the best way to accomplish the processing of 10 million "sets" of records as quickly / efficiently as possible, well, that is a much more involved question. It requires a bit more analysis. At bare minimum it would require reviewing both the logic in your ETL_TEST Stored Procedure and your Schema (Tables, Indexes, etc). So for the moment, there is not enough information presented here to provide any meaningful assistance along these lines.

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
0

There are several ways you could speed this up:

  1. Pass the data in a table-value parameter so you can process them in one call is likely the best way, but it does take a bit of setup and if this is a one-time thing, not really worth it, IMO.
  2. Since you are dealing with a single parameter, turn it into a comma-separated value string and pass it in so you can process a couple thousand per call on the server.
  3. The simplest thing you can do without changing much code is put this in a transaction and commit your records every couple thousand records, or all at once. Doing this will increase the speed by about 100-fold.
Steve
  • 5,585
  • 2
  • 18
  • 32