SQL Query on single table-valued parameter slow on large input

Question

I have a table with this simple definition:

CREATE TABLE Related 
(
    RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
    RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
    CreationTime DateTime NOT NULL,

    PRIMARY KEY(RelatedUser, RelatedStory)
);

with these indexes:

CREATE INDEX i_relateduserid 
    ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)

CREATE INDEX i_relatedstory 
    ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)

And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.

I have this stored procedure:

CREATE PROCEDURE GetStories
    @offset INT,
    @limit INT,
    @input UserIdInput READONLY
AS
BEGIN
    SELECT RelatedStory 
    FROM Related
    WHERE EXISTS (SELECT 1 FROM @input WHERE UID = RelatedUser)
    GROUP BY RelatedStory, CreationTime
    ORDER BY CreationTime DESC
    OFFSET @offset ROWS FETCH NEXT @limit ROWS ONLY;
END;

Using this User-Defined Table Type:

CREATE TYPE UserIdInput AS TABLE 
(
    UID nvarchar(100) PRIMARY KEY CLUSTERED
)

The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.

What index am I missing? this seems to be a pretty straight forward query on a single table.

Have you considered changing your WHERE EXISTS to a JOIN. Joins tend to perform better, especially with larger sets. — John Cappelletti, Oct 03 '16 at 19:13
As far as I've been able to google, EXISTS is preferred if you're checking for existence, like the answer here suggests: http://stackoverflow.com/questions/7082449/exists-vs-join-and-use-of-exists-clause Would this not be the case? — bech, Oct 03 '16 at 19:20
Yes, but as you mentioned, performance suffers as your set expands.. If the JOIN key is not indexed, you may get better results with EXISTS. My thinking is this..."Like chicken soup for a cold... couldn't hurt to try." — John Cappelletti, Oct 03 '16 at 19:30
Update: Rewriting the query to use JOIN instead of Exists seems to amount to the same execution plan: http://i.imgur.com/iX86tWO.jpg With UID being the primary key, shouldn't it be able to do index seek for the temp table? — bech, Oct 03 '16 at 19:30
bech: just FYI, you don't have a temp table. You have a table _variable_ that is acting as a Table-Valued Parameter. — Solomon Rutzky, Oct 03 '16 at 19:34
@srutzky - Thank you. Is that just a nomenclature thing or can it impact performance when I really want a temp table? — bech, Oct 03 '16 at 19:35
@bech A TVP can only be a Table Variable. When not needing a table as an input parameter, then yes, there are differences between temporary tables and table variables. — Solomon Rutzky, Oct 03 '16 at 19:43
@bech For the `i_relatedstory` index, why are you including the key field, `RelatedStory`, and why is `CreationTime` an include column and not the 2nd key field? I would think that `(RelatedStory, CreationTime)` and no include columns would be better. This is similar to Paparazzi's 2nd suggestion. Remember, "Include" columns are not used for sorting or comparisons, only for covering. Still, using INT for PK / FK will still be optimal. — Solomon Rutzky, Oct 03 '16 at 20:30
Sorry, RelatedStory on i_relatedstory was a mistype. It was meant to say RelatedUser. I tried creating a different index on CreationTime that included RelatedStory, but I am yet to try a index on both CreationTime and RelatedStory. After I do the normalization, I'll try that too. — bech, Oct 03 '16 at 20:36
`GROUP BY RelatedStory, CreationTime` does not make sense to me, do You expect that multiple users will be related to the same RelatedStory with the same CreationTime and You have to handle this particular case (different CreationTime is not deduplicated)? There is also typo in index - RelatedLassoId. — Antonín Lejsek, Oct 03 '16 at 21:13
The group by is a way of eliminating duplicate RelatedStories. Several users can be related to a single story, and we want want the query to return @limit unique items for every page. — bech, Oct 04 '16 at 07:54
But they would not be unique if CreationTime differ for different users for Story. You would return duplicate Stories. — Antonín Lejsek, Oct 04 '16 at 09:01
Fair point, but what is probably not clear from the context, CreationTime is just straight up denormalization (through duplication) from the Story it relates to, so for the same story, they will always be the same. — bech, Oct 04 '16 at 09:37

Solomon Rutzky · Answer 1 · 2016-10-03T21:09:40.870

What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.

Some "better" (but not ideal) solutions would be:

If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.

Your best bet is to:

Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.

The entire system will thank you, and show you its appreciation by being more efficient :-).

One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.

From O.P.'s comment on this answer:

[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers

Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).

In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?

Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.

While I see your point about NVARCHAR as foreign keys, this is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers. In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter? — bech, Oct 03 '16 at 19:34
@srutzky Thanks for the input (and updated explanation). I'm currently trying this on a copy of the database to see the difference. From a gut feeling, do you think we can obtain the same results for 1-2 thousands of input rows as we can for say, 100, with this change? (I know, bit of stretch to guess about these things but still..) — bech, Oct 03 '16 at 19:54
@bech I updated again. Please review :). I can't say how much faster it will be, especially since I recommended several options and I don't know which one you are trying. But I can guarantee that going with the "ideal" approach of using an INT for the PK / FK columns and using `UID` only once, in the `User` table, will make all queries using this code and/or joining these tables. — Solomon Rutzky, Oct 03 '16 at 20:07
@srutzky I'm trying the ideal approach of normalizing the database for the UserId table.. With 13 million rows it takes a while to set all the IDs ;) I'm sort of worried that the increased effort to first find the rows in the UserId table and then joining on the Relevance table will negate the bonus of searching on integers instead of strings, but I really wouldn't know. I appreciate the effort you've put into this, and I'll definitely mark it as the answer should it work. — bech, Oct 03 '16 at 20:23
@bech please see my updates. I clarified a few things and added an additional suggestion. — Solomon Rutzky, Oct 03 '16 at 21:10
@srutzky: I've now tried (on a different database) normalizing the userid by adding an integer ID, using that and adding an index to it on Related. It does not seem to have improved the queryplan, as it now uses 15% on scanning the TVP, 15% on seeking the UserId table, 18% on seeking the Related table for the new ID, and then finally 52% on sorting (It used to be 18% on seeking the TVP, 18% on seeking the Related table and then the remainder was sorting) I tried adding the index (RelatedStory, CreationTime) but I don't think it's being used :( — bech, Oct 04 '16 at 07:32
I've managed to find an index that gets the query time down to an acceptable amount. I've posted it as a post to this thread. — bech, Oct 04 '16 at 09:00

score 2 · Accepted Answer · answered Oct 04 '16 at 08:57

So I finally found a solution.

While @srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.

@Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:

CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)

solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.

I think what gave me the revelation was @srutzky noting:

Remember, "Include" columns are not used for sorting or comparisons, only for covering.

which made me realize I needed all my groupby and orderby columns in the index.

So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.

paparazzo · Answer 3 · 2016-10-03T20:08:28.137

1

The main problem seems to be that it uses 63% of the effort on sorting.

ORDER BY CreationTime DESC

I would suggest and index on CreationTime

Or try an index on RelatedStory, CreationTime

edited Oct 03 '16 at 20:08

answered Oct 03 '16 at 20:02

paparazzo

44,497
23
105
176

Hi. Yep, I tried that, but it didn't seem to want to use it, even after I added a where clause using the indexed field, and recompiled the execution plan. Any thoughts on why? – bech Oct 03 '16 at 20:24
You tried both of those? And I would go with a join over the exists. – paparazzo Oct 03 '16 at 20:25
I've actually only tried the index only on the CreationTime field. Will try on both RelatedStory and CreationTime. – bech Oct 03 '16 at 20:27

SQL Query on single table-valued parameter slow on large input

3 Answers3