1

I am a multithreading novice and a SQL novice, so please excuse any rookie mistakes.

I am trying to execute many SQL queries asynchronously. The queries are all select statements from the same table in the same database. I can run them synchronously and everything works fine, but testing a small subset leads me to believe that to run all the queries synchronously would take approximately 150 hours, which is far too long. As such, I'm trying to figure out how to run them in parallel.

I have tried to model the code after the answer at run a method multiple times simultaneously in c#, but my code is not executing correctly (it's erroring, though I do not know specifically how. The code just says an error occurs).

Here is what I have (A much smaller and simpler version of what I am actually doing):

class Program
{
    static void Main(string[] args)
    {
        List<string> EmployeeIDs = File.ReadAllLines(/* Filepath */);
        List<Tuple<string, string>> NamesByID = new List<Tuple<string, string>>();

        //What I do not want to do (because it takes too long) ...
        using (SqlConnection conn = new SqlConnection(/* connection string */))
        {
            foreach (string id in EmployeeIDs)
            {
                using (SqlCommand cmd = new SqlCommand("SELECT FirstName FROM Employees WITH (NOLOCK) WHERE EmployeeID = " + id, conn))
                {
                    try
                    {
                        conn.Open();
                        NamesByID.Add(new Tuple<string, string> (id, cmd.ExecuteScalar().ToString()));
                    }
                    finally
                    {
                        conn.Close();
                    }
                }
            }
        }


        //What I do want to do (but it errors) ...
        var tasks = EmployeeIDs.Select(id => Task<Tuple<string, string>>.Factory.StartNew(() => RunQuery(id))).ToArray();
        Task.WaitAll(tasks);
        NamesByID = tasks.Select(task => task.Result).ToList();
    }

    private static Tuple<string, string> RunQuery(string id)
    {
        using (SqlConnection conn = new SqlConnection(/* connection string */))
        {
            using (SqlCommand cmd = new SqlCommand("SELECT FirstName FROM Employees WITH (NOLOCK) WHERE EmployeeID = " + id, conn))
            {
                try
                {
                    conn.Open();
                    return new Tuple<string, string> (id, cmd.ExecuteScalar().ToString());
                }
                finally
                {
                    conn.Close();
                }
            }
        }
    }
}

Note: I do not care exactly how this is multithreaded (tasks, parallel.foreach, backgroundworker, etc). This is going to be used to run ~30,000 select queries exactly 1 time, so I just need it to run fast (I'm hoping for ~8 hrs = one work day, but I'll take what I can get) one time. It doesn't have to really be pretty.

Thank you in advance!

BrianH
  • 337
  • 1
  • 6
  • 12
  • 1
    Is a select all not an option then process them client side? Running this many individual queries seems a very inefficient way to do it. How many records are in the table? – Tone Sep 14 '17 at 00:21
  • you have `30,000` queries to run, is it because there are `30,000` employee ? – Squirrel Sep 14 '17 at 00:27
  • @Tone I'm not sure if a select all would work. The complication I can see is that there is a subquery in the real version. (The real version has a subquery to get what is "id" in my example) So the real query looks like "Select FristName from table where Employee ID = (Select top 1 EmployeeID from otherTable where variable = value). Would that prevent selecting all? (Like I said, SQL novice.) To your second point, each table (query and subquery) have hundreds of thousands of rows and the query table has 200+ columns. – BrianH Sep 14 '17 at 00:27
  • @Squirrel That's an example that shows what I'm going for. In reality I'm getting info on 30,000 transactions, and yes, I've narrowed the hundreds of thousands of transactions in the database to these 30,000 of interest. – BrianH Sep 14 '17 at 00:30
  • A trick I've used in the past... Use a SQL script to create a separate Agent job for each execution. The same script kicks off each job as it's created and the final step of each job is to delete itself upon sussessful completion of the query step. – Jason A. Long Sep 14 '17 at 00:36
  • @BrianH I'd still expect a select all to be a better option than so many queries (select only the fields you need - e.g. ID, FirstName). Not really familiar with subqueries but I'd expect it to work. `WHERE IN` may be a better option but if you run into problems with performance there, another alternative is to send over a table of IDs to use in a single query. e.g. something like this answer [Selecting multiple rows by ID, is there a faster way than WHERE IN](https://stackoverflow.com/a/17439466/1364358). That's not something I have enough knowledge about but somebody else may have input. – Tone Sep 14 '17 at 00:46
  • If your code takes 150 hours when run synchronously and you manage to get a perfect speed improvement running them on a 8 core CPU then your queries will still take about 19 hours. In reality though multi-threading only gets the maximum improvement with in-memory operations. In fact SQL operations might **actually be slower** with multi-threading. You need to figure out a better query and not try to force a bad one to work faster. – Enigmativity Sep 14 '17 at 02:23

3 Answers3

3

This is just plain wrong. You should build one query to select all FirstNames you need. If you need to pass a bunch of ids to the server, that is no problem, just use table valued parameter (aka TVP), coma separated list of values really does not scale well. If the query is correctly written and the tables indexed, that should be quite fast. 100k rows table is a small table.

The query then may look like this

SELECT DollarAmount, comp.CompanyID
FROM Transactions 
JOIN (SELECT MIN(TransactionID) as minTransactionID, CompanyID 
      FROM CompanyTransactions       
      GROUP BY CompanyID
     ) AS comp 
ON Transactions.TransactionID = comp.minTransactionID
JOIN @IDList ON id = comp.CompanyID

You may use IN instead of JOIN if the ids in TVP are not unique.

Btw. do you know what NOLOCK means? If you are the only user of the database and use it single threaded or do not modify any data, then you are safe. Other than that it means that you are okay with a small chance of:

  • some records may be missing in the result
  • there are duplicate records in the result
  • there are rows in the result, that have never been committed and never were accepted as valid data
  • if you use varchar(max), you may get text that has never been stored
Antonín Lejsek
  • 6,003
  • 2
  • 16
  • 18
  • Could you give me an example of how I would do that? (SQL novice) I'm not sure what you mean by "Table Valued Parameter". I could send it as a comma separated list (I know what that is), but passing in 30,000 IDs in a comma list seems ... Excessive. Thanks. Sorry for my lack of knowledge. – BrianH Sep 14 '17 at 00:56
  • I added link to explanation. TVP has one thing to know about, you have to create table type on the server. But it really pays back and it is really about two lines of code. You may start with stored procedure to grasp the principle, but TVP is not limited to that, you may use this parameter in any other query. – Antonín Lejsek Sep 14 '17 at 01:16
1

You want to do one query to get all of the ID/Name combinations, then put them into a dictionary (for quick access). This will remove the very slow process of running 30,000 queries as well as reduce the complexity of your code.

I could get you something more concrete if you posted the actual SQL query (you can change the column and table names if you need) but this should be close:

;WITH CompTransCTE AS (
    SELECT CompanyID, MIN(TransactionID) AS TransactionID
    FROM CompanyTransactions
    WHERE CompanyID IN (/*Comma seperated list of values*/)
    GROUP BY CompanyID
)
SELECT CT.CompanyID, T.DollarAmount, T.TransactionID
FROM Transactions AS T
INNER JOIN CompTransCTE AS CT ON CT.TransactionID = T.TransactionID;
Trisped
  • 5,705
  • 2
  • 45
  • 58
  • The actual query (with names changed because to be honest I'm not sure if I'm allowed to post it or not so I'll err on the side of caution) is "SELECT DollarAmount FROM Transactions WHERE TransactionID = (SELECT TOP 1 TransactionID FROM CompanyTransactions WHERE CompanyID = @CompanyID ORDER BY TransactionID ASC) and my file has all the CompanyIDs that are fed into the query – BrianH Sep 14 '17 at 00:45
  • @BrianH I have updated the query. Let me know if you have issues. I will be off line for a few hours, but I will check when I get back on. – Trisped Sep 14 '17 at 00:58
  • I'll have to get back to you tomorrow, as I'll test it at work tomorrow. I'll let you know then. I really appreciate the help. I do have 1 follow up question: I have 30,000 IDs. Will SQL handle a comma separated list with 30,000 entries? And if so would it really be faster, or is this more about the right way of doing things vs getting the time down? – BrianH Sep 14 '17 at 01:01
  • @BrianH Not sure if it will handle 30,000 in a comma separated list. If not, put them in a table first and replace the list with `(SELECT CompanyID FROM #Table)`. It will definitely be faster. SQL is great with large data, but it sucks with large numbers of queries. – Trisped Sep 14 '17 at 01:04
1

Without creating a User-Defined Table Type in the database, you can use SqlBulkCopy to load the IDs into a temp table, and reference that in the query.

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;

namespace ConsoleApp11
{
    class Program
    {

        static void Main(string[] args)
        {
            //var EmployeeIDs = File.ReadAllLines(""/* Filepath */);
            var EmployeeIDs = Enumerable.Range(1, 30 * 1000).ToList();
            var dt = new DataTable();
            dt.Columns.Add("id", typeof(int));

            dt.BeginLoadData();
            foreach (var id in EmployeeIDs)
            {
                var row = dt.NewRow();
                row[0] = id;
                dt.Rows.Add(row);
            }
            dt.EndLoadData();

            using (SqlConnection conn = new SqlConnection("server=.;database=tempdb;integrated security=true"))
            {
                conn.Open();

                var cmdCreateTemptable = new SqlCommand("create table #ids(id int primary key)",conn);
                cmdCreateTemptable.ExecuteNonQuery();

                //var cmdCreateEmpable = new SqlCommand("create table Employees(EmployeeId int primary key, FirstName varchar(2000))", conn);
                //cmdCreateEmpable.ExecuteNonQuery();


                var bc = new SqlBulkCopy(conn);
                bc.DestinationTableName = "#ids";
                bc.ColumnMappings.Add("id", "id");
                bc.WriteToServer(dt);

                var names = new List<string>();
                var cmd = new SqlCommand("SELECT FirstName, EmployeeId FROM Employees WHERE EmployeeID in (select id from #ids)", conn);
                using (var rdr = cmd.ExecuteReader())
                {
                    var firstName = rdr.GetString(0);
                    var id = rdr.GetInt32(1);
                    names.Add(firstName);
                }

                Console.WriteLine("Hit any key to continue");
                Console.ReadKey();
            }



        }


    }
}
David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67