Slow query with many rows (simple query with no joins)

Question

I have a query that usually takes over five or six minutes to run. It has a WHERE and ORDER BY clause but no JOINs. But if I simplify the query to just this:

SELECT * FROM ReportIndex

It still takes over five or six minutes to run.

The table has over eleven million rows.

Since there's no join, I can't see how an index would make it faster. Can anyone suggest anything else I might try to speed up this query?

UPDATE:

This is the actual query I'm using. It returns 6,668,324 rows.

SELECT '1' AS vrsID,
    lprKey,
    lprRptName,
    lprTitle,
    lprDate,
    lprOwner,
    lprUserID,
    lprArchiveDate,
    lprTrackDate,
    lprActionView,
    lprActionEmail,
    lprActionExcel,
    lprActionForward,
    lprActionReassign,
    lprActionDownload,
    lprActionLocalPrint,
    lprActionServerPrint,
    lprPageCount,
    lprBytes,
    lprDataType,
    lprArchived,
    lprJobName,
    lprViewed,
    lprRptID
FROM ReportIndex
WHERE (lprOwner IN ('admin', 'APAdmin', 'APClerk', 'AP-Tab-700', 'AP-Tab-A-B', 'AP-Tab-A-K_EMP', 'AP-Tab-AP520', 'AP-Tab-CAN', 'AP-Tab-C-E', 'AP-Tab-COMM', 'AP-Tab-Confidential', 'AP-Tab-EE-Waiting', 'AP-Tab-F-O', 'AP-Tab-Historical', 'AP-Tab-LCD', 'AP-Tab-LCD_EMP', 'AP-Tab-LEA', 'AP-Tab-LPP', 'AP-Tab-LPS', 'AP-Tab-LPS_EMP', 'AP-Tab-LSI', 'AP-Tab-LTI', 'AP-Tab-L-Z_EMP', 'AP-Tab-P-R', 'AP-Tab-S-Z', 'AP-Tab-Unknown', 'Group-Category-VendorDocuments', 'Group-Quality-Control', 'Group-VendorDocs', 'HRAdmin', 'HR-Category-Payroll', 'HR-Category-Performance', 'HR-Category-Personnel', 'HR-Category-Upload', 'HR-Document-Delete', 'HR-Document-Index', 'HR-Group-DocumentMaintenance', 'HR-IndexQueue-Email', 'HRROLE01', 'HRROLE02', 'HRROLE03', 'HRROLE04', 'HRROLE05', 'HRROLE06', 'HRROLE09', 'HRROLE10', 'HRROLE11', 'HRROLE12', 'HRROLE13', 'HRROLE14', 'HRROLE15', 'HRROLE16', 'HRROLE17', 'HRROLE18', 'HRROLE19', 'HRROLE21', 'HRROLE23', 'HRROLE24', 'HRROLE25', 'HRROLE26', 'HRROLE28', 'HRROLE29', 'HRROLE30', 'HRROLE31', 'HRROLE34', 'HRROLE35', 'HRROLE36', 'HRROLE37', 'HRROLE39', 'HRROLE41', 'HRROLE42', 'HRROLE43', 'HRROLE44', 'HRROLE45', 'HRROLE46', 'HRROLE47', 'HRROLE48', 'HRROLE49', 'HRROLE50', 'HRROLE51', 'HRROLE52', 'HRROLE53', 'HRROLE54', 'HRROLE55', 'HRROLE56', 'HRROLE57', 'HRROLE58', 'HRROLE59', 'HRROLE60', 'HRROLE61', 'HRROLE62', 'HRROLE63', 'WFAdmin', 'AccountsPayable')
AND lprArchived = 0 AND lprPendingDelete = 0) AND lprDone=0
ORDER BY lprDate DESC

UPDATE 2:

Here's the execution plan: https://www.brentozar.com/pastetheplan/?id=Skjuu8sLX

Do you really need to `SELECT` all the columns and all rows? Using the `WHERE` clause - how many rows are returned as a result? It might be possible to create index to optimize the search, but first tell if you need all the columns. — gotqn, Aug 22 '18 at 05:27
@gotqn: No. I just wanted to simplify the query to demonstrate that the `WHERE` or `ORDER BY` clauses weren't to blame. The query I'm actually using filters the rows and names the columns I need, but is still too slow. — Jonathan Wood, Aug 22 '18 at 05:28
Please add something we could work with besides a simple `select all`. And add the execution plan of the query and the table structure. — juergen d, Aug 22 '18 at 05:31
If there is nothing private in the query, you can post it. I need to see what are selected columns and how its search. Also, your primary key definition, if such exists. — gotqn, Aug 22 '18 at 05:31
Can you add the where and order by clauses back in and post the execution plan? — Ola Ekdahl, Aug 22 '18 at 05:31
Do you have an index that includes lprOwner, lprArchived, lprPendingDate, lprDone and lprDate? — user5151179, Aug 22 '18 at 05:38
@gotqn: Actually, it appears there is no primary key. (Note that I did not design this database.) I can play with that. Would that really make it slower? — Jonathan Wood, Aug 22 '18 at 05:40
When your query returns 6 million records then your indexes won't be much of a use! — juergen d, Aug 22 '18 at 05:40
@juergend: Yes, I know. I didn't write this software. I've been asked to try and improve it. But it seems like it should be faster than it is. — Jonathan Wood, Aug 22 '18 at 05:41
@JonathanWood How many rows the given query returns? Also, we need to create `PK`, too (before creating other indexes). Do you have a column(s) good candidate for primary key? — gotqn, Aug 22 '18 at 05:46
@gotqn: The filtered query is returning 6,668,324 rows. Obviously, that's part of the problem. I don't think I have a good candidate column for the PK. I think that's part of the problem. The person that designed the schema wasn't very good. I'm trying to get the execution plan now. — Jonathan Wood, Aug 22 '18 at 05:49
Then, let's just create a simple identity column - https://stackoverflow.com/a/4862427/1080354. After that, we can try to create a filtered nonclustered index. Can you create such PK? — gotqn, Aug 22 '18 at 05:53
@OlaEkdahl: The execution plan is too big. Stackoverflow won't let me add the XML to the question. — Jonathan Wood, Aug 22 '18 at 05:57
@gotqn: Doing that now. But the execution plan data says 23% is due to lack of index, and 74% is taken by the sort, which is not on an indexed column. I think those two issues along with the fact that it returns way more rows that we could possibly need is the answer. — Jonathan Wood, Aug 22 '18 at 06:04
I think a use-case might be in order here. What happens to the 6M rows? Presumably, you don't display them on a screen, or in a report. If you need to get half of your 11M rows then it might just take time. Id also test the query replacing the big IN clause with a join to a selected_owner table, but I doubt that would help. — TomC, Aug 22 '18 at 06:54
_'The execution plan is too big'_ - is `ReportIndex` a table or a view? — Ivan Starostin, Aug 22 '18 at 07:58
Jonathan Wood Can you confirm that ReportIndex is a table and not a view (as per @Ivan Starostin comment above) please? — MJH, Aug 22 '18 at 10:13
How the execution plan of plain select can be big? Upload it here then and let's take a look: https://www.brentozar.com/pastetheplan/ — Ivan Starostin, Aug 22 '18 at 19:41
@IvanStarostin: I don't know. I had it generate the execution plan when I ran the query. I could find no way to copy that information so I exported it to XML. The XML was big and, in fact, when I attempted to paste it into my question, I got an error saying it exceeded the maximum question size. — Jonathan Wood, Aug 22 '18 at 20:57
"here" - I meant "brentozar" URL in the end of my comment. And share here (in the question) link to it. @JonathanWood — Ivan Starostin, Aug 22 '18 at 21:00
74% of your execution plan is for the sort operator (as you have already stated in another comment), you can avoid this by creating a clustered index on lprDate (descending), this may take some time to complete, but should avoid the sort as the data is in the same order as required by your query. — MJH, Aug 23 '18 at 07:49
How long does the query take if you include the WHERE clause but remove the ORDER BY? — MJH, Aug 23 '18 at 08:27
@JonathanWood penultimate paragraph from Gordon's answer. Same as MJH advice. There is not much you can do unless your query becomes much more selective. — Ivan Starostin, Aug 23 '18 at 09:15

score 1 · Answer 1 · answered Aug 22 '18 at 11:22

1

This is too long for a comment.

Your query is essentially:

SELECT . . .
FROM ReportIndex
WHERE lprOwner IN (<big list>) AND
      lprArchived = 0 AND
      lprPendingDelete = 0 AND
      lprDone = 0
ORDER BY lprDate DESC

The standard recommendation is an index on something like ReportIndex(lprArchived, lprPendingDelete, lprDone, lprOwner, lprDate).

However, your query is selecting 6.6 million rows out of 12 million. An index won't help, because the SELECT is not sufficiently selective. If you were selecting 6.6 thousand rows, then an index might be a big help.

There is little you can do. If this is really important, you can create a clustered index on (lprDate). This will take a while to create -- all the data has to be sorted on the data pages.

The query should then proceed by reading the data in order and applying the WHERE conditions.

answered Aug 22 '18 at 11:22

Gordon Linoff

1,242,037
58
646
786

I'm still testing but, as mentioned in the question, it seemed like the performance was really slow even without the `WHERE` clause. – Jonathan Wood Aug 22 '18 at 22:18
1

The query is slow for two reasons: 1. You are selecting a lot of data (Table Scan 23% in your execution plan). and 2. The Sort operator (74% of your execution plan). There's nothing you can do about the Table Scan while you are selecting so many rows, but you can do something about that Sort operator by creating a clustered index on the lprData column. – MJH Aug 23 '18 at 07:53
@MJH: Could you explain a little bit about why it should be a clustered index vs an unclustered index? – Jonathan Wood Aug 23 '18 at 14:02
If you create a non-clusered index it would have to contain all the columns referenced in your query (order by columns, where clause columns, join columns and select list) otherwise the plan will have to do a "lookup" against the data for each row. Putting a clustered index on your sort column (lprDate DESC) will cause the data on the disk to be stored in that order, so the read operation will read the data from disk in the order required by the sort, which eliminates the need for a sort operator in your plan. – MJH Aug 23 '18 at 14:09
@MJH: Thanks. Can you tell me if the fact that I don't have a primary key has any impact on these indexes? – Jonathan Wood Aug 23 '18 at 14:45
1

A Primary Key would make minimal difference to performance in this case. EDIT: Although, I should say that it is good practice for all tables to have a PK. – MJH Aug 23 '18 at 14:49
@MJH: Yes, I always give tables I design a primary key, but looks like that's not the case with everyone. Thanks for your insights. – Jonathan Wood Aug 23 '18 at 15:01
@MJH: I'm sorry but could I ask one more question? It appears the user can specify different columns to sort by. I know that you can only have one clustered index on a table. And, assuming multiple columns in a clustered index is a different sort from any one of those columns (except the first), would it make sense to put non-clustered indexes on the columns they can sort by? – Jonathan Wood Aug 24 '18 at 17:20
@JonathanWood . . . I don't think so. I think the `IN` ends up precluding the use of the index. – Gordon Linoff Aug 24 '18 at 17:59
@GordonLinoff: My question was actually to MJH. I'm not really doing anything about the IN` clause as removing that line doesn't appear to make any noticeable difference. – Jonathan Wood Aug 24 '18 at 18:00
Yeah, if the users can change which column the query uses to sort on then a clustered index would not help for each situation. – MJH Aug 24 '18 at 18:35

score 0 · Answer 2 · answered Aug 22 '18 at 08:15

Efficiency wise try to avoid using the IN clause. If you have a large list of values inside the IN it can be very slow. To optimise its usually far bett er to added these values to a TVP and then join to it. You can also then add indexing to make the join more efficient. give the following a try:

DECLARE @tvp_lprOwner AS TABLE (owner NVARCHAR(255) NOT NULL);

INSERT INTO @tvp_lprOwner (owner)
VALUES ('admin') , ('APAdmin') , ('APClerk') , ('AP-Tab-700') , ('AP-Tab-A-B') , ('AP-Tab-A-K_EMP') , ('AP-Tab-AP520') , ('AP-Tab-CAN') , ('AP-Tab-C-E') , ('AP-Tab-COMM') , ('AP-Tab-Confidential') , 
('AP-Tab-EE-Waiting') , ('AP-Tab-F-O') , ('AP-Tab-Historical') , ('AP-Tab-LCD') , ('AP-Tab-LCD_EMP') , ('AP-Tab-LEA') , ('AP-Tab-LPP') , ('AP-Tab-LPS') , ('AP-Tab-LPS_EMP') , ('AP-Tab-LSI') , 
('AP-Tab-LTI') , ('AP-Tab-L-Z_EMP') , ('AP-Tab-P-R') , ('AP-Tab-S-Z') , ('AP-Tab-Unknown') , ('Group-Category-VendorDocuments') , ('Group-Quality-Control') , ('Group-VendorDocs') , ('HRAdmin') , 
('HR-Category-Payroll') , ('HR-Category-Performance') , ('HR-Category-Personnel') , ('HR-Category-Upload') , ('HR-Document-Delete') , ('HR-Document-Index') , ('HR-Group-DocumentMaintenance') , 
('HR-IndexQueue-Email') , ('HRROLE01') , ('HRROLE02') , ('HRROLE03') , ('HRROLE04') , ('HRROLE05') , ('HRROLE06') , ('HRROLE09') , ('HRROLE10') , ('HRROLE11') , ('HRROLE12') , ('HRROLE13') , 
('HRROLE14') , ('HRROLE15') , ('HRROLE16') , ('HRROLE17') , ('HRROLE18') , ('HRROLE19') , ('HRROLE21') , ('HRROLE23') , ('HRROLE24') , ('HRROLE25') , ('HRROLE26') , ('HRROLE28') , ('HRROLE29') , 
('HRROLE30') , ('HRROLE31') , ('HRROLE34') , ('HRROLE35') , ('HRROLE36') , ('HRROLE37') , ('HRROLE39') , ('HRROLE41') , ('HRROLE42') , ('HRROLE43') , ('HRROLE44') , ('HRROLE45') , ('HRROLE46') , 
('HRROLE47') , ('HRROLE48') , ('HRROLE49') , ('HRROLE50') , ('HRROLE51') , ('HRROLE52') , ('HRROLE53') , ('HRROLE54') , ('HRROLE55') , ('HRROLE56') , ('HRROLE57') , ('HRROLE58') , ('HRROLE59') , 
('HRROLE60') , ('HRROLE61') , ('HRROLE62') , ('HRROLE63') , ('WFAdmin') , ('AccountsPayable');

SELECT  '1' AS vrsID
,       RI.lprKey
,       RI.lprRptName
,       RI.lprTitle
,       RI.lprDate
,       RI.lprOwner
,       RI.lprUserID
,       RI.lprArchiveDate
,       RI.lprTrackDate
,       RI.lprActionView
,       RI.lprActionEmail
,       RI.lprActionExcel
,       RI.lprActionForward
,       RI.lprActionReassign
,       RI.lprActionDownload
,       RI.lprActionLocalPrint
,       RI.lprActionServerPrint
,       RI.lprPageCount
,       RI.lprBytes
,       RI.lprDataType
,       RI.lprArchived
,       RI.lprJobName
,       RI.lprViewed
,       RI.lprRptID
  FROM  ReportIndex RI
  JOIN @tvp_lprOwner O ON RI.lprOwner = O.owner
 WHERE  (lprArchived = 0 AND lprPendingDelete = 0)
   AND  lprDone = 0
 ORDER BY lprDate DESC;

Worth running it, then look at the execution plan, and index from there if need be.

Yes, this was my suggestion in an earlier comment. I suspect it won't help (apart from readability) but definitely worth testing — TomC, Aug 22 '18 at 08:43
Sorry Tom. Didn't see the comment - it was hidden in the "more comments" bit. You're right though TVP is the way to go for this. Without going into to much detail on the indexes - the execution plan should make a suggestion on that side of things based on the actual database stats. Provided its sensible I'd go with that. — Matthew Baker, Aug 22 '18 at 08:55
Interesting to see @mathew_baker. Sometimes these things behave very differently with a #temp table too. Always worth testing both. — TomC, Aug 22 '18 at 11:47
I initially posted only the simplified version of the query (without the `IN` clause) because that was slow on its own. I included my actual query because I had several comments saying I should. So while I appreciate the suggestion here, I'm not sure it addresses my issue because I really couldn't notice any performance improvement when I remove the `WHERE` clause completely. — Jonathan Wood, Aug 24 '18 at 18:03

user5151179 · Answer 3 · 2018-08-22T05:52:38.927

-1

Try creating a new index:

CREATE INDEX [IX_Composite] ON [ReportIndex] ([lprOwner], [lprArchived], [lprPendingDate], [lprDone], [lprDate]);

These 5 fields are included because 4 of them appear in your WHERE clause, and the last one is used in your ORDER BY.

Depending on your use case, it might be worth looking at paging the results. This can help where transferring the data in one big chunk is at the core of your issue.

edited Aug 22 '18 at 05:52

answered Aug 22 '18 at 05:44

user5151179

575
2
10

But it doesn't sort on those columns. So I'm not understanding why an index on them would make a difference. – Jonathan Wood Aug 22 '18 at 06:10
Having an index covering the columns in your WHERE clause will make the query run faster. Try creating the index and see if it helps.. It may get you over the line. – user5151179 Aug 22 '18 at 06:16
What about the columns in the SELECT list? Do you think SQL Server will choose a plan with 6 Million lookups? – MJH Aug 22 '18 at 10:12

score -1 · Answer 4 · answered Aug 22 '18 at 11:11

From your responses on the comments section.

you mentioned there are no indexes on the table, and also, this might not be the only issue, hardware also would play a huge part on this as well.

So, what you need to do is first check the hardware specs, if is it outdated or can't handle that much of data processing, then stop there, and advise them to upgrade with a proper hardware. If they have a good hardware, then you can proceed to other SQL Server suggestions.

First, data order : I see in your query you're ordering the data by lprDate in descending order, this would be an expensive part of the query especially on large scale. Practical example is to use the same query without a where clause, just use ORDER BY lprDate DESC, and then compare it with the same query without it (use TOP 10000 to test). You'll see a performance difference between the two queries. If you always select the recent data from the table (which most people do), then you need to create a clustered index with a Descending order. This would resort all the data inside the table and also the new data will be always at the top page. The only issue that you might experience is resorting a 11M rows would take too much time to process, and sometimes the process fails. So, what I actually do (to avoid that) is recreating the same table with different name. Then create a new clustered index with the proper sort, along with the proper indexes that needed. Then when I'm done, I just reinsert the records by batches. You can do the same method, you can use lprDate or create an IDENTITY (or use the current if any) to make them clustered and adjust the sort to DESC, add then the proper indexes. After that, reinsert the data (in batches) into the new table, and make sure you specify the order by in your batches.

You can use this method of reinserting the data by batches :

DECLARE @Count  INT 
SET @Count = 1
WHILE @Count > 0
BEGIN
    INSERT INTO NewTable(Columns)
    SELECT 10000 Columns
    FROM 
        OriginalTable
    ORDER BY ID 

    SET @Count = @@ROWCOUNT
END

it'll insert 10 thousend rows in each run. you could increase or decrease the number of rows as you wish, whatever works best with you.

Now you could test the query in the new table, see how it would perform.

You might need to create an index for the actual query (in the post) with reording the WHERE clause as well.

So, your WHERE clause will be like this :

WHERE 
    lprArchived = 0 
AND     lprPendingDelete = 0
AND     lprDone=0
AND     lprOwner IN (.....)

AND your index will be :

CREATE INDEX [XReport] ON (lprArchived, lprPendingDelete, lprDone, lprOwner) INCLUDE (lprKey,  lprRptName,  lprTitle,  lprDate,  lprOwner,  lprUserID,  lprArchiveDate,  lprTrackDate,  lprActionView,  lprActionEmail,  lprActionExcel,  lprActionForward,  lprActionReassign,  lprActionDownload,  lprActionLocalPrint,  lprActionServerPrint,  lprPageCount,  lprBytes,  lprDataType,  lprArchived,  lprJobName,  lprViewed,  lprRptID)

AS for the lprOwner IN (.....) There are another approches that you could take :

First one (since you have large set of values), I would consider a comparsion between lprOwner values that is in the table, and the current used values. To see how much actually left in the table since you have about 94 values.

Basically :

SELECT DISTINCT lprOwner FROM ReportIndex WHERE lprOwner NOT IN(current used values)

If the total values returned are less than 94 rows, then using the reverse method will be much better. Which would be using something like this :

AND lprOwner NOT IN(the returned values)

You can also use Temp or TVP methods to join them with the query so it would be

SELECT ....
FROM ReportIndex ri 
LEFT JOIN (The values table) t ON ri.lprOwner = t.lprOwner
WHERE 
    lprArchived = 0 
AND     lprPendingDelete = 0
AND     lprDone=0

Another method is to use sub-query like this :

SELECT *
FROM (
SELECT .... , lprOwner 
FROM ReportIndex 
WHERE 
    lprArchived = 0 
AND     lprPendingDelete = 0
AND     lprDone=0
) D 
WHERE 
lprOwner .....

Using the sub-query method, will make the results first filtered by lprArchived, lprPendingDelete, and lprDone.. Then, The results will be filtered by IprOwner. So, you're filtering in two sets rather than one set. This could be useful in huge tables sometimes.

This is what I've in mind, surely, not every method or suggestion could be applied in all situations, but at least the idea itself could lead you to your salvation

Slow query with many rows (simple query with no joins)

4 Answers4