0

I have a table with millions of entries, and I'm loading data from it using a stored procedure (which applies pagination/sorting and some other things).

My problem is one of the parameters I pass in is a large list of CompanyId and EmployeeId values. These values are structured as CompanyId:EmployeeId and the string would look like this:

@companiesAndEmployees = 'CO5555:EM10001,CO5555:EM10002,CO5555:EM10003,CO7777:EM10004' --can by thousands of items long

CompanyId+EmployeeId is how I can uniquely identify the employees I want to load items from the database for. I use STRING_SPLIT() on this list to "filter" my query for just the employees I want.

SELECT *
FROM MyTable mt
    INNER JOIN STRING_SPLIT(@companiesAndemployees, ',') ce
        ON ce.value = mt.CompanyId + ':' + mt.EmployeeId

My problem is that this list of employees can theoretically be many thousands of items long, and this query begins to choke to the point where I get this SQL timeout exception:

System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details.

System.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

System.ComponentModel.Win32Exception:

What would be the most performant way of retrieving rows only for the users I'm asking for, when there can be many thousands of them?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Andrio
  • 1,852
  • 2
  • 25
  • 54
  • 2
    JSON/XML or a table-valued parameter are the obvious approaches (TVPs are nice and structured but require the freedom to declare and maintain a type). "Performant" depends not only on how you get the list of items there but also how effectively it can be used in a query; there are trade-offs to be made there too (using the results directly, or in a table variable or temp table with appropriate indexes). See [here](http://www.sommarskog.se/arrays-in-sql-2005.html) for much more. – Jeroen Mostert Apr 01 '21 at 20:54
  • @JeroenMostert Thank you for the response. Getting the list itself is no problem, it's just after it's been handed to the stored procedure do I see that timeout happen. What's interesting is that the timeout only happens when returning a large result (> 1000), but if returning a small result it's fast. The odd thing about that is that the same list is passed in both scenarios – Andrio Apr 01 '21 at 20:57
  • 1
    Whether it's the best design or not is a different question but sending the data "to" your procedure is already as performant as you can make it - serializing as json or a data table will just add bloat and metadata, your string is already as compact as it can be. Your isues is more of query performance. Don't use string_split in-line, use it to populate a #temp table, index your temp table and join to it in your query. – Stu Apr 01 '21 at 21:02
  • 1
    If "the same" list is being passed, why wouldn't you also get "the same" results? Do you perhaps mean a list of comparable length but different values, or the same query but applied to a table with different contents? In either case it is probable that `STRING_SPLIT` is the problem, because even if splitting the list itself isn't a bottleneck, the outcome of that defies correct cardinality estimates and can cause bad execution plans, especially if (as here) you then end up joining on an expression, which prohibits efficient use of indexes. – Jeroen Mostert Apr 01 '21 at 21:03
  • 2
    As your data has more than one column, a TVP would be miles better than a string to split, because you effectively need to split twice: once for the rows, then ideally again for the columns, and only then would you end up with a table you can efficiently join on. Fun if you ever need to add more columns, too. – Jeroen Mostert Apr 01 '21 at 21:05
  • @JeroenMostert Here's an example of what I meant: If I want to retrieve these items with a date range of just today, I'll pass in the huge list of employees and maybe just 5 items are returned (again, since the daterange passed in is 1 day). This call is very fast, probably under 500ms. If instead I pass in a daterange spanning years, we again pass in the same employee list, but this time thousands would get returned (or at least they should, but it times out). It's not a lot of data being loaded per row, either. I'm not familar with TVPs but I'll look into them. – Andrio Apr 01 '21 at 21:18
  • @Stu and JeroenMostert I think you guys are right that the STRING_SPLIT is the problem. The stored proc actually has to use it twice (for reasons). I'm guessing it would be better to put it into a temp table? – Andrio Apr 01 '21 at 21:20
  • 2
    Absolutely yes; the overhead of inserting into a temp table is usually insignificant compared to the savings of the optimizer producing a better plan and reduced logical reads. – Stu Apr 01 '21 at 21:22

1 Answers1

1

SqlBulkCopy is an extremely efficient way to get data to the SQL Server.

You can use it with a temporary table, then do a SELECT joining the temp table to MyTable.


I'm a bit unclear if you're applying this before or after a stored procedure. If you are doing it after, you can insert the result of the stored proc to a temp table, then join the two.

If you're doing it before, and need to pass this, things get a bit sketchy, but are possible: How to pass a temp table as a parameter into a separate stored procedure

gregmac
  • 24,276
  • 10
  • 87
  • 118