3

I'm working on a "data obfuscator" script. I want to update every record in a table with fake data. I have a table with locations and trying to select a random record from that table to update the records in my person table. So SQL similiar to this

UPDATE Person
SET    City = (SELECT TOP 1 City
               FROM   z.CityStateZip c2
               ORDER  BY Newid()) 

The problem with this is that it only ever selects 1 city, instead of selecting a random City for each person. I've also tried

(SELECT TOP 1 City FROM z.CityStateZip c2 ORDER BY NEWID()), PersonId, from Person

But it still only selects 1 city (I thought sub queries ran once for every record) instead of what I desire -- a random city for each record.

I've also tried doing a join with the same results, only 1 city ever selected....

SELECT t.City,
       PersonId
FROM   Person
       INNER JOIN (SELECT TOP 1 City
                   FROM   z.CityStateZip c2
                   ORDER  BY Newid()) t
               ON 1 = 1 

I tried to shove this statement into a function but SQL Server doesn't allow me to use NEWID() inside a function.

Answer

I modified Giorgi's answer and the answer to the linked question and came up with this, it is very fast! I changed how I was randomly selecting a city. The order by NewId() was an issue. So Person has 5k records and CityStateZip has ~30K, I took it down from 40 seconds to 4...(now even faster without the count subquery)

DECLARE @count bigint 
SELECT @count = count(*) from z.CityStateZip

UPDATE p
SET p.City= b.City
FROM Person p
CROSS APPLY (SELECT TOP 1 City -- if Id is unique, top 1 shouldn't be necessary
             FROM z.CityStateZip 
             WHERE p.SomeKey = p.SomeKey and -- ... the magic! ↓↓↓
             Id = (Select ABS(Checksum(NewID()) % @count))) b
CSharper
  • 5,420
  • 6
  • 28
  • 54

1 Answers1

1

You should force the database engine to evaluate new value for each row. You can do this by adding dummy where clause on outer table like:

DECLARE @city TABLE(ID INT IDENTITY(1, 1), City VARCHAR(100))

INSERT INTO @city VALUES
('Dallas'),
('New York'),
('Washington'),
('Las Vegas')


DECLARE @random TABLE(ID INT IDENTITY(1, 1), City VARCHAR(100))

INSERT INTO @random VALUES
('Manchester'),
('London'),
('Oxford'),
('Liverpool')


SELECT * FROM @city c
CROSS APPLY(SELECT TOP 1 * FROM @random r WHERE c.ID = c.ID ORDER BY NEWID()) ca

if you remove WHERE c.ID = c.ID you will get same value for all rows.

Giorgi Nakeuri
  • 35,155
  • 8
  • 47
  • 75
  • Thanks Giorgi, I was able to get this to work, along with the answer from the dup question. Is there a more efficient way of doing this? The cross apply and NEWID() is pretty slow. I guess I could use TableSample but that doesn't give me truly random results. – CSharper May 27 '15 at 15:03
  • check out my edit, I was able to increase speed by a couple orders of magnitude – CSharper May 27 '15 at 16:43