2

I have a fairly large table with a few million rows. I am trying to write an efficient code that would select rows where two columns have values in the list of pairs passed to it from python code. A reasonable answer was posted.

E.g.

select *
from table
where convert(id1) + '-' + id2 in ('2261-7807403','2262-9807403' )

The returned table is put to dataframe via pd.read_sql_query. I have two issues here (besides the one that it is slow). One, id2 can be NULL and the query fails for these rows. Another, a more important issue, is that the size of the tuple list in the where clause can vary wildly, from 1 to millions.

My understanding is that in the large list case it is better to import the whole columns into python with pandas and then filter them there. But how do I make this transition between small and large number of lists smooth? Is there a way to do it by some clever combination of SQL Server and Python?

guyguyguy12345
  • 539
  • 3
  • 11

2 Answers2

1

The where clause of

convert(id1) + '-' + id2 in ('2261-7807403')

is not optimal. We can easily see that here we expect id1 to be 2261 and id2 to be 7807403, so it's much better to have a where clause like

where (id = '2261' and id2 = '7807403')

You can do a where clause of the like of (...) or (...) or (...) and so on. If you want to allow id2 values of null, you can

where (id = '2261' and isnull(id2, '7807403') = '7807403')

If you still have performance problem, then you might consider indexing these columns.

Now, we still have to discuss a problem, unaddressed so far, which is size. You need to define what big is in your case. If the number of elements is big, then export the query into a CSV file (or multiple files, if necessary) and load it using Python. So, to avoid the issue of loading big data when inappropriate you can select count(*) first and determine whether it's big or not.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
1

You should pass the values using correct types. One method is to construct a derived table and use join:

select t.*
from table t join
     (values (2261, 7807403), . . . 
     ) v(id1, id2)
     on t1.id1 = v.id1 and t.id2 = v.id2;

You should not be converting values to strings for comparison. That is not only "ridiculous", but it can seriously impede performance.

As for NULLs, they will not match unless you have special logic. SQL does not have a null-safe comparison.

select t.*
from table t join
     (values (2261, 7807403), . . . 
     ) v(id1, id2)
     on (t1.id1 = v.id1 or t1.id1 is null and t2.id2 is null) and 
        (t.id2 = v.id2 or t1.id1 is null and t2.id2 is null);

Note that this might affect the execution plan.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I was running this query with no issues when the number of values was small, but it is stuck as soon as the number is more than 100. Any solutions? Might we need somehow to index `v` table? – guyguyguy12345 Jan 28 '20 at 18:00
  • @guyguyguy12345 . . . An index on `t1(id2, id2)` will work for the first query. Optimizing the second is trickier. – Gordon Linoff Jan 28 '20 at 18:47
  • How do we set up that index? why trickier ? `NULL` is not allowed for a composite index? – guyguyguy12345 Jan 28 '20 at 19:30
  • 1
    @guyguyguy12345 . . . SQL Server will probably not use the index for the more complex comparisons needed to support `NULL`. Indexes definitely support `NULL`s; that is not the issue. SQL Server doesn't have a `NULL`-safe comparison operator and is bad at optimizing `or`s. – Gordon Linoff Jan 28 '20 at 19:37