2

I have a table S with only 500 rows, and a table F with 120000 rows. Both use GUID primary keys and table F holds a foreign key to table S. Table F contains a varbinary(max) column F.Data with about 100 KB per row (total database size is about 10 GB). Filestream is turned on. I'm using SQL Server 2014 Express.

When I make the following UPDATE statement (in SQL Server Management Studio), which affects approximately 100000 rows

UPDATE F
SET F.Data = 0
FROM F
INNER JOIN S
ON S.SID = F.SID
WHERE S.BITFIELD = 1 AND S.Date < DATEADD(DAY,-90,GETDATE())

the query takes around 30 minutes. That's rather unacceptable but I don't know enough about SQL to know why or how to make this query more efficient. Any gurus out there who can help?

FYI, the equivalent SELECT statement takes only a few seconds. I've searched around Stackoverflow and elsewhere, and haven't found anything particularly helpful (given my limited knowledge of SQL).

Michael Repucci
  • 1,633
  • 2
  • 19
  • 35
  • 6
    100,000 with 5Mbytes/row turns into 500 Gbytes of updated information. That's a log. All of it needs to be logged. – Gordon Linoff May 25 '17 at 16:49
  • I think you've answered your own question, you've got a big table and you're updating lots of it.... – Matt May 25 '17 at 16:50
  • 2
    It gets logged and it also has to end up in the tables. Suggest breaking your update into smaller chunks. Eventually, you're just moving so much data that it's going to be slow. – Donnie May 25 '17 at 16:52
  • @Matt, my mistake. I updated the question. Each row is about 100 KB, the total database size being around 10 GB. – Michael Repucci May 25 '17 at 16:56
  • @GordonLinoff, my mistake. I updated the question. Each row is about 100 KB, the total database size being around 10 GB. – Michael Repucci May 25 '17 at 16:57
  • It sounds like you need to update 100k rows given your current design so if it were me I'd be questioning the data model and/or the design of the process that requires updating 100k rows. It looks like you're clearing old data so is that really necessary? Can it be done on a more frequent schedule with smaller updates? Etc. – Matt May 25 '17 at 17:13
  • @Matt, that's a good strategy, if necessary. I was just hoping that maybe I'd overlooked something else, like the INNER JOIN causing it to be slow or because the data is Filestream (not that I can change that, but I like to understand). It just didn't seem like so much data to me. – Michael Repucci May 25 '17 at 17:48

3 Answers3

0

I see three things which can be addressed here:

  1. You haven't mentioned how many seconds it takes for the equivalent select statement to return, though if it is quite a few seconds (as in, not under 10) you may want to use a variable for a date instead of running a DATEADD function 100k times. The syntax for that would be:

    DECLARE @MyDate as DATETIME = DATEADD(DAY,-90,GETDATE());  
    UPDATE F
    SET F.Data = 0
    FROM F
    INNER JOIN S
    ON S.SID = F.SID
    WHERE S.BITFIELD = 1 AND S.Date < @MyDate
    
  2. You can opt to do the updates in chunks of, say 10k rows; this would not lock up as much and possibly return faster.

  3. Another thing which I would check would be the number of indexes which are on the table F. When you're selecting, the optimizer would decide which index to use and you would be done, whereas, in an update, all of the indexes containing the affected field would need to be updated as well.

Comment: The GUID as the PK is not helping performance here. If you have numerous non-clustered indexes, the GUID issue is exacerbated.

Eli
  • 2,538
  • 1
  • 25
  • 36
  • 1. Adding a variable for the date made no difference. 2. This may be my only option - just wanted to make sure I wasn't overlooking something. 3. I don't understand indexes super well, but SSMS says the indexes are clustered indexes based on the PK, which are GUIDs. And, unsurprisingly, they're terribly fragmented (>85% on S, >99% on F). – Michael Repucci May 25 '17 at 19:40
  • @MichaelRepucci There's an awesome video from world class DBA, Brent Ozar explaining indexes. I feel that this will help you immensly. Here's the link: https://www.brentozar.com/archive/2016/10/think-like-engine-class-now-free-open-source/ – Eli May 26 '17 at 16:58
0

Have you tried creating a temp table with just one field,( S.SID ) and all the records that match with WHERE S.Date < DATEADD(DAY,-90,GETDATE()) then join to it in your UPDATE, rather than computing in the where clause during the update?

Also, Index on GUID may not be as good as using index on INT. read this GUID vs INT IDENTITY Good luck.

Something like this :

CREATE TABLE [#TEMPTBL1]([SID] uniqueidentifier);
CREATE CLUSTERED INDEX IDX_TEMPTBL1_SID ON [#TEMPTBL1]([SID]);  
INSERT INTO [#TEMPTBL1]([SID])
            SELECT ([SID]) FROM S 
            WHERE S.BITFIELD = 1 
            AND S.Date < DATEADD(DAY,-90,GETDATE());


UPDATE F
SET F.Data = 0
FROM F
INNER JOIN #TEMPTBL1 TMP ON F.SID = TMP.SID

DROP TABLE #TEMPTBL1;

---------- code update with counter --------

DECLARE @updtCounter int = 0;

CREATE TABLE [#TEMPTBL1]([SID] uniqueidentifier);
CREATE CLUSTERED INDEX IDX_TEMPTBL1_SID ON [#TEMPTBL1]([SID]);  
INSERT INTO [#TEMPTBL1]([SID])
            SELECT ([SID]) FROM S 
            WHERE S.BITFIELD = 1 
            AND S.Date < DATEADD(DAY,-90,GETDATE());

SELECT @updtCounter = count(*) FROM F
INNER JOIN #TEMPTBL1 TMP ON F.SID = TMP.SID

UPDATE TOP (@updtCounter) F
SET F.Data = 0
FROM F
INNER JOIN #TEMPTBL1 TMP ON F.SID = TMP.SID

DROP TABLE #TEMPTBL1;
Mario Levesque
  • 1,017
  • 13
  • 13
  • This helped a little; shaved off 5 minutes. I've read the content you linked too, which leads me to believe that the slowness may be due to a clustering problem with the data in F. Because the GUID on that table is the clustered index (default SQL Server behavior not overridden), the UPDATE would have to jump all over the place to alter the data. – Michael Repucci May 25 '17 at 20:32
  • Hi again, here is another small thing you can try. You can see if you can use a counter to evaluate exactly how many rows you have to update before you start your update. Then when you run the Update you stop the table scan once you have reached that number knowing that your updates are actually done. See code update below my answer. – Mario Levesque May 30 '17 at 12:40
0

Some more suggestions: 1. This could be a rare case where a cursor may improve performance by breaking the UPDATE into smaller chunks. You mentioned that table S has 500 rows and table F has 120K rows, so if they are roughly evenly distributed, there are 240 rows in F for each row in S.

Declare @SID uniqueidentifier;
Declare c cursor forward_only for
    SELECT ([SID]) FROM S 
            WHERE S.BITFIELD = 1 
            AND S.Date < DATEADD(DAY,-90,GETDATE()); 

Fetch next from c into @SID
While @@fetch_status = 0
    Begin
    UPDATE F
        SET F.Data = 0
        FROM F
        WHERE F.SID = @SID
    Fetch next from c into @SID
    End
Deallocate c
  1. Also you may get better performance using a Begin Trans and Commit around the Update.

  2. Depending on how often records in table S are having the BitField set to 1, you could put the update into a trigger if it's not very frequent.

  3. Another approach could be to select the data from F only when the BitField in S is not set:

Select CASE WHEN S.BitField=1 THEN 0 ELSE F.Data END as Data FROM F INNER JOIN S ON S.SID = F.SID

The select statement is intended to make it seem that F.Data contains 0 when the BitField in S is set to 1. You could put the Select into a view and then use the view instead of the table when accessing F in other queries. Even though the F.Data field still contains the 100KB value, any time you select from the view, it will show F.Data as either 0 or the actual value depending on S.BitField. You would still need to execute the Update if you need to reduce the disk space being used, but you could schedule that for times when the system is not in use.

chrisuae
  • 1,092
  • 7
  • 8
  • Thanks for the suggestions; I'm learning a lot. #1 actually made it slightly longer by 2 minutes. #2 shouldn't (and didn't) work as it's a single statement already. #3 is a good idea for future improvement of this process. I don't understand, however, what you're suggesting with #4. Can you clarify? – Michael Repucci May 26 '17 at 15:04
  • Please see clarification added to #4. – chrisuae May 26 '17 at 16:37
  • Ah, I see. Thanks for the clarification. That's a cool trick. But you were correct to guess that this is primarily to reduce space usage by clearing out old data. – Michael Repucci May 26 '17 at 19:27