2

So I'm working on some database 'de-identification' where essentially each piece of information changes. On most of the smaller tables a simple while update wasn't too time consuming (up through tables with 10,000 rows or so. I've now moved to one that has approximately 500,000 rows.

I've read that the fastest way to do this sort of "update" is to in fact just select into a temporary table updating the column you need. (I read it here. Fastest way to update 120 Million records)

The problem with this is that the OP is updating all similar values with a single value where as each of my values are different, ie he updates the null rows in a single column to -1, I'm updating each column in my new row to more or less a random date; This is what I've got so far.

--The only Index on Treatments is a Clustered Primary Key (TreatmentID)
SELECT * INTO #Treatments_temp
FROM Treatments
CREATE CLUSTERED INDEX IDX_Treatments ON #Treatments_temp(TreatmentID)

SET @rows = (SELECT TOP 1 TreatmentID
             FROM Treatments
             ORDER BY TreatmentID Desc)

WHILE (@rows > 0)
  BEGIN

    --There are only 500,000 records in this table from count(*) but the PK is much 
    --higher (some records are deleted, made in error ETC so this if statement is my
    --attempt to bypass the code for @rows that don't actually exist.

    IF (SELECT TreatmentID FROM #Treatments_temp WHERE TreatmentID = @rows) IS NOT NULL
      BEGIN
      DECLARE @year INT;
      DECLARE @month INT;
      DECLARE @date INT;
      DECLARE @newStartDate SMALLDATETIME;
      DECLARE @multiplier FLOAT;

      SET @multiplier = (SELECT RAND());

      SET @year = @multiplier * 99 + 1900;
      SET @month = @multiplier * 11 + 1;
      SET @date = @multiplier * 27 + 1;

      SET @newStartDate = DATEADD(MONTH,((@year-1900)*12)+@month-1,@date-1);

      UPDATE #Treatments_temp
      SET StartDate = @newStartDate
      WHERE TreatmentID = @rows

      UPDATE #Treatments_temp
      SET EndDate = DATEADD(MINUTE, @timebetween, @newStartDate)
      WHERE TreatmentID = @rows
      END

  SET @rows = @rows - 1
  END
Community
  • 1
  • 1
Kulingar
  • 941
  • 3
  • 14
  • 30
  • Can you share your table structure too please. Also what indexes are on your `Treatments` table? – JNK Nov 04 '11 at 12:15
  • It would be helpful to see an example of ther current columns involved and what they would be updated to be. That would help us identify if you can do this using set-based operations, rather than a loop – Josh Darnell Nov 04 '11 at 12:17
  • @JNK Treatments has 10 columns with no obnoxious data types, int's, datetimes, bools. The only Index is a clustered Primary Key (TreatmentID) Thanks! – Kulingar Nov 04 '11 at 12:28
  • 1
    @jadarnel27 I've included what I'm updating. The dates being updated are just as random as what my code creates just need the dates to change though the time between stays the same. Thanks! – Kulingar Nov 04 '11 at 12:29
  • "a simple while update wasn't too time consuming" - This may be a stupid question, but did you try a regular `UPDATE` query? Without a loop? I have a table of about the same size that I've had to UPDATE a whole column, and it wasn't all that slow. – Josh Darnell Nov 04 '11 at 13:45

3 Answers3

2

Without knowing more about what you have, I think it would be simplest to:

  • Put your "randomizing" logic into a scalar function
  • Make a narrow table with just your ID and the result of the function for each ID
  • Update your Treatment table with an INNER JOIN on the narrow table to get the new value

There's no need for a row-by-row approach to this.

JNK
  • 63,321
  • 15
  • 122
  • 138
  • I can put the random date generation into a scalar function assuming it will help. + You mean instead of SELECTing everything into the temporary Table I should only do the PK and the 2 other fields I'm working with (StartDate, EndDate)? Even if so, won't it take just as much time to go back and update with the other values after that? + Mentioned in part 2. – Kulingar Nov 04 '11 at 12:38
  • @Kulingar - Yes, don't put everything in the temp table, just what you are changing. And no, it won't take the same amount of time. You are moving half the amount of data. – JNK Nov 04 '11 at 12:50
  • Actually, UDF's are supposed to be deterministic, it's going to be interesting to see how you will make them return random data. – deroby Dec 09 '11 at 16:12
1

I think this should work :

-- using NewID() instead of Rand() because Rand() is only interpreted once for the entre query while NewID() is for each record
-- Based on your logic I understand newStartDate had to be between 1 jan 1801 and 28 dec 1999 
DECLARE @multiplier float
DECLARE @max_int    float
DECLARE @daterange  float

SELECT @max_int   = Power(Convert(float, 2), 31), -- signed int !
       @daterange = DateDiff(day, '1 jan 1801', '28 dec 1999')

UPDATE Treatments
   SET @multiplier  = (@max_int - Convert(real, ABS(BINARY_CHECKSUM(NewID())))) / @max_int, -- returns something between 0 and 1
       StartDate    = DateAdd(day, Convert(int, (@daterange * @multiplier)), '1 jan 1801') -- returns somewhere in the daterange

-- test 'spread'
SELECT COUNT(*), COUNT(DISTINCT StartDate), Min(StartDate), Max(StartDate) FROM Treatments

If anyone wants to test this you can use this to generate some test-data (@Kulingar: make sure not to drop your table by accident =)

IF DB_ID('test') IS NULL CREATE DATABASE test
GO
USE test
GO
IF Object_ID('test..Treatments') IS NOT NULL DROP TABLE test..Treatments
GO
SELECT row_id = IDENTITY(int, 1, 1), StartDate = CURRENT_TIMESTAMP INTO Treatments FROM sys.columns, sys.objects
deroby
  • 5,902
  • 2
  • 19
  • 33
0

i'd accomplish this writing a small program which would:

  1. select * and put in a structure
  2. change data in the structure
  3. replace old data with new data (drop table+create table or truncate+insert into might apply)

this way you perform your logic outside database and limit crud to the bare essential.

Alex
  • 23,004
  • 4
  • 39
  • 73