Fast bulk update

Question

I need to update an entire table. I need a given row (501 in this case) to process against all rows in the table (including itself).

This is what I have so far, but it takes 8 seconds (1.6 million rows):

DECLARE @ID INT = 501;
DECLARE @Lambda DECIMAL(10, 5) = 0.5;
set statistics time on

UPDATE RectangleOne
SET RectangleOne.Jaccard = ((1 - @Lambda) * RectangleOne.Jaccard) + @Lambda *dbo.fn_ComputeJaccard(
        RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
        RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY)
FROM PreProcessed RectangleOne
INNER JOIN PreProcessed RectangleTwo
ON RectangleTwo.ID = @ID

set statistics time off

I'm hoping someone can see an obvious bottleneck? If not, what should I be looking at to try to get performance improvements?

I should add that performing the calculation takes only 290ms (tested with):

DECLARE @ID INT = 501;
DECLARE @Lambda DECIMAL(10, 5) = 0.5;
set statistics time on
    SELECT COUNT(1)
    FROM PreProcessed RectangleOne
    INNER JOIN PreProcessed RectangleTwo
    ON RectangleTwo.ID = @ID
    WHERE ((1 - @Lambda) * RectangleOne.Jaccard) + @Lambda *dbo.fn_ComputeJaccard(
            RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
            RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY) > -1
set statistics time off

And performing a basic update takes only 1500ms (tested using):

UPDATE PreProcessed
SET PreProcessed.Jaccard = PreProcessed.Jaccard*@Lambda;

So it's not clear to me why my actual query takes as long as it does...

6 seconds doesn't sound like a "performance bottleneck" for updating 1.6 million rows. — Gordon Linoff, Jan 15 '14 at 17:12
St the necessary rectangle two values to variables. Then you won't need the self join. — Dan Bracuk, Jan 15 '14 at 17:17
@GordonLinoff - That's the bottleneck in the whole operation. If I do it in C# (memory) then I can get it down to .3s. I'm not expecting it to be that fast, but ideally sub-second... — Faraday, Jan 15 '14 at 17:19
@Vijay . . . When you do the calculations in C#, they are not ACID compliant. You are simply updating values in memory, not maintaining logs, roll-back capabilities, and commits to disk, along with protections for other users accessing the data at the same time. There is no comparison. — Gordon Linoff, Jan 15 '14 at 17:23
@GordonLinoff - I get that. That's why I said I don't expect it to be that that. I, as always believe that my code could be improved upon. If that's not the case, then well, great! But I'm pretty sure someone know a little trick that I don't. Forget the comparison for the moment, is there anything that you are aware of which may be able to improve performance in my example code? — Faraday, Jan 15 '14 at 17:25
The following link describes how to set variables from select queries. http://technet.microsoft.com/en-us/library/aa259186(v=sql.80).aspx — Dan Bracuk, Jan 15 '14 at 17:30
@DanBracuk - I know how to do that. It's not clear to me how that approach would help. I'm having to process a single rectangles against all of the other 1.6 million, I would doubt that repeatedly selecting into a variable would help, so I have to assume I'm missing the point... Could you elaborate? — Faraday, Jan 15 '14 at 17:33
I've added more information, perhaps there is no way of making this faster, but I have to hope there is! — Faraday, Jan 15 '14 at 18:24
@DanBracuk almost had it... (*four*, not *two*, and he had a typo) -- up front, use a `SELECT` to set four variables, to the four values you need from RectangleTwo: RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, and RectangleTwo.MaxY. Then pass those variables to the function... and you won't need the self join. — Doug_Ivison, Jan 15 '14 at 19:24
ahhhh... Right, I'm sure that would result in a pretty good speed improvement. Thanks for the clarification! Feel free to post it as answer along with any other more "general" tips and I'll mark it as the answer... Someone needs to get these points! :) — Faraday, Jan 15 '14 at 22:47
I can't tell if this is what everyone above said, but you should do a `SELECT @var = [other code to calculate]` and then do that last `UPDATE [table] SET [table].[field] = @var`. Then you will have all your calculation done and the update will be fast. Your calculation still might take a bit. Also, 1.6 million records in SQL Server is getting toward big-ish, so it might take a little bit of time (depending on how the table is built). Your first example code will do the calculation once *for every row in the table* (1.6 million times) — tommy_o, Jan 15 '14 at 23:27
Since Dan was onsite 9 hours ago, and the last of these comments was 12 hours ago... I guess he's not posting. (If he does post, I say upvote his too... biggest idea was his.) As you'll see below, I moved the calc of `1 - @Lambda` to a variable, too. (And a trivial P.S.: I wonder if `1.0 - @Lambda` might be a tiny bit faster -- which would only matter when it's being repeated -- so the `1` doesn't have to be CAST from `int` to `decimal`.) — Doug_Ivison, Jan 16 '14 at 12:10

Doug_Ivison · Accepted Answer · 2014-01-16T12:00:06.080

So, as discussed in the comments above, one way to save a chunk of time is to use variables to make the INNER JOIN unnecessary.

Also... probably minor... I created a new variable @Lambda_Flip, so the calculation of 1 - @Lambda is only being done once... and then @Lambda_Flip gets used, at that spot in the UPDATE.

Lastly, a cosmetic change: I put an additional pair of parentheses around (@Lambda * dbo.fn_ComputeJaccard(@R2MinX, @R2MinY, @R2MaxX, @R2MaxY,RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY)). Even though you and I know that it's already being executed before the addition, due to SQL's order-of-operations... I thought those parentheses might make it easier for the reader to follow, reading left-to-right.

DECLARE @ID INT = 501;
DECLARE @Lambda DECIMAL(10, 5) = 0.5;
DECLARE @Lambda_Flip DECIMAL(10, 5) = 1 - @Lambda;
DECLARE @R2MinX INT, @R2MinY INT, @R2MaxX INT, @R2MaxY INT;
set statistics time on

SELECT @R2MinX = MinX, @R2MinY = MinY, @R2MaxX = MaxX, @R2MaxY = MaxY
FROM PreProcessed 
WHERE ID = @ID;

UPDATE RectangleOne
SET RectangleOne.Jaccard = (@Lambda_Flip * RectangleOne.Jaccard) 
+ (@Lambda * dbo.fn_ComputeJaccard(
        @R2MinX, @R2MinY, @R2MaxX, @R2MaxY,
        RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY))
FROM PreProcessed RectangleOne

set statistics time off

If you use this or a variation of it, then (as always) please let us and posterity know what worked, and how the performance changed :) ...thanks!

I did use something similar to this, I ended up putting the `(@Lambda_Flip * RectangleOne.Jaccard)` part into my CLR function too. All in I saved .27 of a second, not too shabby! :) — Faraday, Jan 16 '14 at 12:26
@Vijay P.S. It would be interesting to see if there any effect, from changing both `@Lambda` and `@Lambda_Flip` to `float`. Note, from **[this SO link](http://stackoverflow.com/questions/1056323/difference-between-numeric-float-and-decimal-in-sql-server/7158770#7158770)**: "Approximate Numeric Data Type usually uses less storage and have better speed (up to 20x)..." Heck, could try `float` for the `@R2MinX` etc. variables, too. — Doug_Ivison, Jan 17 '14 at 18:48

Fast bulk update

1 Answers1