-1

I have a query:

UPDATE TOP(100000) pv
SET    pv.intUrlId = urls.intUrlId
FROM   [schema1].[Urls] urls WITH(NOLOCK)
       INNER JOIN [schema2].[PageViews] pv WITH(NOLOCK)
         ON pv.urlId = urls.id
            AND pv.intUrlId IS NULL 

Every column used in this query is an index and urls.id is a PRIMARY KEY. But query is still too slow. Trying to improve performance I've rebuilded pv_urlId_IDX index adding pv.intUrlId column as 'included column'. My reasoning was as follows: Query needs to search pv_urlId_IDX index to perform JOIN and next pv_intUrlId_IDX to determine NULL records. If I add value of pv.intUrlId to pv_urlId_IDX index a testing of second condition would be performed 'in place' and searching second index wouldn't be performed. Unfortunatelly I haven't noticed any performance boost.

I also read SQL Server documentation and they every time mention of nonkey column in scope of included columns. So my question is: does including key column in index make any sense and if it does when we can benefit of such solution.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
MKB
  • 281
  • 2
  • 14
  • 5
    Why are you using `WITH(NOLOCK)` on the table that is the target of the `UPDATE`? How slow is "too slow"? Any triggers or foreign keys on the table? What does the execution plan look like? – Martin Smith Sep 10 '13 at 09:56
  • `WITH(NOLOCK)` - I've got an advice from previous database manager to use this where possible because it makes query faster. I think 'slow' is less than 1000 records per socond. Query vary - it started with 4000 records per second. And after 90mln updated rows it was only 300 records per second. I've tried many solutions. Only filtered index `WHERE IS NULL` as pv_intUrlId_IDX helped (700 records). But after next 40mln updates it slows down to 300 records. No triggers and foreign keys. Table 200 mln records. I've asked overall question independent of tuning my query. Please answer it as well! – MKB Sep 10 '13 at 10:22
  • I've probably found problem with my query. It uses nonclustered index on `urls.id` with included `urls.intUrlid` column instead of Urls table PRIMARY KEY to perform JOIN. I need to remove that index. But my question is still valid and it is important to me to know if key columns as included columns make any sense – MKB Sep 10 '13 at 10:30
  • That index - `urls (id) INCLUDE (intUrlId)` - looks like the best for this UPDATE statement. – ypercubeᵀᴹ Sep 10 '13 at 10:39
  • I think it isn't a good idea @ypercube . I've got such idea before but from what I know now clustered index include full row and urls.id PRIMARY KEY is clustered index so I think next nonclustered index on id column could spoil performance not improve it – MKB Sep 10 '13 at 10:46
  • It's not a bad idea. It's not like a super-boost but it is an index that has all the columns needed for the Update and is narrower than the clustered index on `urls` (assuming that the table has more than these 2 columns.) No wonder that the optimizer chooses it over the CI. – ypercubeᵀᴹ Sep 10 '13 at 10:48

3 Answers3

1

Few items to keep in mind:

  • Did you look into the execution plan?
  • I see that you are updating 100000 rows, please note that many times the optimizer changes from Index Seek to Table Scan when it sees that you are picking more that an x% of the rows in the table.
  • Once you have a column included in the index, your UPDATE will be slower since you are updating it (from NULL to urls.intUrlId)
  • as @MartinSmith wrote, your WITH(NOLOCK) is irrelevant, did you check if you have been locked?
Luis LL
  • 2,912
  • 2
  • 19
  • 21
  • `WITH(NOLOCK)` is irrelevant or it can spoil performance? Because if it can't then using it can only help. This is production server and I can't control what queries except mine are performed on database. – MKB Sep 10 '13 at 10:36
  • Such hint doesn't make sense in an update, I never tried, and I don't know if it will harm or not. No offence, but I'm not going to test if it spoils or is ignored. – Luis LL Sep 10 '13 at 10:42
  • [Is the NOLOCK (Sql Server hint) bad practice?](http://stackoverflow.com/questions/1452996/is-the-nolock-sql-server-hint-bad-practice) – ypercubeᵀᴹ Sep 10 '13 at 10:47
  • @ypercube the general rule is, yes it's very bad practice, like saying **I don't care the quality of the results** I'm not saying that there aren't such cases that you'll be using it, but in general avoid it. Improve your queries to make them run faster. At least that's what my experience taught me. – Luis LL Sep 10 '13 at 10:52
  • Query plan doesn't provide me anything useful. 86% cost is nonclustered index seek `urls (id) INCLUDE (intUrlId)` which is performed for JOIN. 3% is PRIMARY KEY scan on PageViews. Rest is update indexes cost. – MKB Sep 10 '13 at 11:02
  • are the estimated and actual rows similar? – Luis LL Sep 10 '13 at 11:06
  • 1
    Add the execution plan in the question. The numbers (88%, 3%) are not so much important as the structure of the plan, the indexes used, the kind of operations (hash or merge or nested loop joins) and the order of operations. – ypercubeᵀᴹ Sep 10 '13 at 11:13
  • I performed execution plan on `TOP(1000)` rows update. It depends - for nonclustered index seek `urls (id) INCLUDE (intUrlId)` actual rows is 1000 and estimated rows is 1. For PageViews PRIMARY KEY scan actual and estimated rows are nearly the same. BTW. what these values mean? What is the difference between them? – MKB Sep 10 '13 at 11:23
  • I'm not an expert, but I see that when you have such big differences the optimizer can't do it well, it can take much longer, since its statistics aren't worth much. you should google for more details. – Luis LL Sep 10 '13 at 11:29
1

Without the actual execution plan and the CREATE TABLE statements, here are some thoughts:

  • the Urls (id) INCLUDE (intUrlId) does not need to be removed. It's probably the best index (on Urls) for this UPDATE statement. Whether it gives big or small improvement over the clustered index depends on how wide is the table but it's certainly not worse and the optimizer chooses it for that.

  • remove the WITH (NOLOCK). There are numerous links on the web and many on StackOverflow itself that explain why it's bad practice in general.

  • consider (and test) adding an index on PageViews (intUrlId, urlId) or a partial index on PageViews (urlId) WHERE intUrlId IS NULL. Both will improve efficiency of finding the rows that need to be updated. They are kind of two-face options though as the time needed for index updates of the statement will also increase (one more index to be updated. )

  • and last - but not least - thing to try is to split the UPDATE statement in smaller chunks. You can run for example 100 statement with TOP (1000) instead of the big 100K update you are running and test efficiency.

And to answer your question, no, an index on PageViews (urlId) INCLUDES (intUrlId) is not going to be very useful. Because it adds one more index to the table (that the UPDATE has to update as well) and because it's less selective than any of the above two suggested indexes.

ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
  • I've got the same thoughts :-) (except NOLOCK) As I said before - every column used in query has an index. I have partial index (as far my best idea which improved performance 300%) but not on urlId as you propose but on intUrlId: `PageViews (intUrlId) WHERE intUrlId IS NULL`.But for some reasons it slow down after next 40mln rows to the same performance as before changing index to partial index. I thought it will be faster and faster because amount of NULL values shrinks so index should shrink resulting in performance boost. I've also tried different packs sizes and 100000 seems to be optimal – MKB Sep 10 '13 at 12:25
  • The `PageViews (intUrlId) WHERE intUrlId IS NULL` is not very useful, is it? (Think about it, it's an index that has only NULLs !) Have you tried adding one of the two indexes I suggested and removing the NOLOCKs? I don't think there is anything else others can do - until you provide the execution plan. – ypercubeᵀᴹ Sep 10 '13 at 12:29
  • Also: are there any (update) triggers on the table? And if the performance decreases when you do large number of deletes, I'd suspect that the transaction log needs to grow. – ypercubeᵀᴹ Sep 10 '13 at 12:30
  • Query plan is nearly plain: UPDATE (0%) <-- Index update, PageViews_intUlrId (0%) <-- Filter( 0%) <-- Split (0%) <-- Clustered Index Update, PK_PageViews (10%) <-- Compute Scalar 3x (0%) <-- Top (0%) <-- Nested Loops (Inner Join) (0%) (and here starts only parallel operation) <-- Clustered Index Scan, PK_PageViews (3%) and Index Seek (NonClustered) (86%) – MKB Sep 10 '13 at 12:34
  • I haven't tested your proposal of partial index on urlId because index is building for an hour (but I will). But I disagree with you about partial index on intUrlId - it boosted performance 300%. But after that boost queries gradually loose their speed. – MKB Sep 10 '13 at 12:45
  • How many rows does the table have and for how many of them `intUrlId IS NULL` ? – ypercubeᵀᴹ Sep 10 '13 at 12:48
  • 200 mln rows. Now is about 60 mln rows meet condition intUrlId IS NULL – MKB Sep 10 '13 at 13:00
  • 1
    Why didn't you include this info in the first place? That you basically want to update almost all rows of a 200M rows table. Really, the `CREATE TABLE`, the indexes, the tables' sizes and the execution plans of queries you tried, the more details of what exactly you are trying to do would help others help you better (and faster) – ypercubeᵀᴹ Sep 10 '13 at 13:19
  • You are generally right @ypercube - more details = better answers. But I have some bad experience - people don't read every information you provide, especially if your post is very long. And long posts sometimes resulting in not reading post at all or reading only part of information and giving answers which are useless. Second thing - I can't reveal database structure on public. Every time I provide some info it needs to be distort - corporation security rules :-/ – MKB Sep 10 '13 at 13:59
  • I provided table size in some earlier comment. Also everybody ignored my overall question. Everybody focused on solving problem with query but I didn't asked **how to boost my query** I asked **does including key column in index make any sense and if it does when we can benefit of such solution** For me as a BIGDATA beginner knowledge is many times more important than solution of exact problem. – MKB Sep 10 '13 at 14:04
  • I edited my answer but I thought it was clear anyway ("no, the INCLUDE index you suggested in not useful. The problem of the query lies somewhere else") - and that's why people focus elsewhere. – ypercubeᵀᴹ Sep 10 '13 at 14:09
  • If the actual question is "does including key column in index make any sense" then why is is the last sentence and does not even have a ? mark? Withholding information does not improve answers. Clarity, formatting, and clearly stating the question early improves answers. – paparazzo Sep 10 '13 at 14:16
  • What is the difference where is the question if it is only question in post? I clarify problem first and next ask the question. Also construction of english language ensure the question is the question. Question mark is a nuance.And finally - I haven't said tuning my query is not important at all. I just said that my question from the end of my post is at least same important or more. But it seems two problems in one post its enough for you @Blam to state that question is unclear. If next time for overall question I wouldn't provide any code and leave it alone it'll be enough clear for you? – MKB Sep 10 '13 at 14:57
  • I tested your last hint @ypercube about using filtered index on `urlId` column. But unfortunatelly it spoiled query performance. It seems rebuilded PageViews PRIMARY KEY is now faster than any nonclustered key with included useful column. A bit accidentially but finally I gain singnificant performance boost as a benefit of this topic. Now I have 800 rows/s and I assume it is max I can get – MKB Sep 11 '13 at 15:48
0

Index on every column is the problem
Disable the index on

[schema2].[PageViews].[intUrlId]

Perform the update Then rebuild that index

That index is used by the update for a check for null.
I would rather deal with a column scan than maintaining that index.
If the update slows down after a number of updates then most likely the problem is that index is getting fragmented.

Try bigger top and maybe even no top.

Drop the (nolock) and let the optimizer decide

disable index on [schema2].[PageViews].[intUrlId]

UPDATE top (100000) pv
   SET pv.intUrlId = urls.intUrlId  
  FROM [schema1].[Urls] urls WITH (NOLOCK)
 INNER JOIN [schema2].[PageViews] pv
    ON pv.urlId = urls.id
   AND pv.intUrlId IS NULL

rebuild index
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • This may be true (that index gets fragmented) and part of the real problem but I see another issue. If they want to update many rows (like the 60M they mentioned), in chunks of 100K statements, the lack of any index on `intUrlId` means that they have to perform 600 table scans, not one. – ypercubeᵀᴹ Sep 10 '13 at 15:53
  • @ypercube So 600 times. If the scan is faster than the index maintenance then it is faster 600 times. As the index fragments both inserts and index scans slow down. I fight this a lot and I have found that for mass inserts or updates index maintenance takes longer than a full scan every time unless the data is loaded in the order of the index. – paparazzo Sep 10 '13 at 16:04
  • I'm not saying that your answer is wrong. It may be the best (but in my opinion, it would make more sense if the UPDATE could be done in one statement.) – ypercubeᵀᴹ Sep 10 '13 at 16:38
  • @ypercube I just left the top in. I would try without the top. – paparazzo Sep 10 '13 at 18:21
  • Your approach didn't improve performance but didn't spoil it as well. It proves one thing - index on `intUrlId` which was created intentially for this query was useless. Also testing your approach accidentally gave me real gain. I disabled more than one index and then used `ALTER INDEX ALL REBUILD`. Before I asked question I rebuilded all indexes combined with columns used in query because of huge fragmentation (and noticed no significant change). But I haven't rebuilt PageViews PRIMARY KEY - the only index really matters. Now it is 800 rows/s and probably it's max. Thanks for any suggestions – MKB Sep 11 '13 at 15:30