9

In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?

Table definition:

CREATE TABLE person
(
  id integer NOT NULL,
  first_name character varying,
  last_name character varying,
 CONSTRAINT person_pkey PRIMARY KEY (id)
)

In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).

My try:

with
first_names as (
select row_number() over (order by random()),
       first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()), 
       id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;

It takes hours to complete.

Is there an efficient way to do it?

Serge
  • 121
  • 1
  • 7
  • 1
    Your two CTEs don't really randomize the data. You just generate random "row numbers" but then you don't use them. If you leave out the `row_number()` calls it wouldn't be any different. You are also doing a cross join between the two CTEs and the `person` table which means your statement generates an intermediate result of 100.000 * 100.000 * 100.000 rows. –  Nov 05 '15 at 22:00
  • http://stackoverflow.com/a/3100232/3574819 might help make updates faster – FuzzyTree Nov 06 '15 at 00:18

2 Answers2

7

This one takes 5 seconds to shuffle 500.000 rows on my laptop:

with names as (
  select id, first_name, last_name,
         lead(first_name) over w as first_1,
         lag(first_name) over w as first_2
  from person
  window w as (order by random())
)
update person
  set first_name = coalesce(first_1, first_2)
from names 
where person.id = names.id;

The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.

There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.

Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1

The query on the right hand side checks if any first name stayed the same after the "randomizing"

  • Unfortunately, the first method might choose the same value multiple times and skip over others. I'm not sure if that is what the OP intends. – Gordon Linoff Nov 05 '15 at 22:16
  • can you give me a tips how you create a 500k sample table? – Juan Carlos Oropeza Nov 05 '15 at 22:16
  • @GordonLinoff: I tried this several times with names that contained the original id and only once had a run where one name wasn't changed. I think the bigger the table the more likely this is going to work. The one with the "random offset" doesn't actually work that well with a small number of rows. –  Nov 05 '15 at 22:19
  • @a_horse_with_no_name . . . It is not that a name is not changed. It is that this finds a random name each time. So, some names in the original data could appear 2, 3, or 4 times. And others might never be chosen. I have the impression (although the OP is definitely not clear about this) that the intention is permutation of the original names (each one used exactly once) rather than independent random samples. – Gordon Linoff Nov 05 '15 at 22:22
  • wouldn't `lead` return null for the 1st value of the random list? which means that one of the names would get lost – FuzzyTree Nov 05 '15 at 22:23
  • @a_horse_with_no_name . . . I changed my mind. The OP says "shuffle", but the business problem is simply to anonymize the list. This approach is fine. You can use `coalesce()` if you are worried about `NULL` values. – Gordon Linoff Nov 05 '15 at 22:29
  • horse, first query took 120 sec in my pc, how your take 5 sec? I know my hard drive isnt fast but have i7 and 8gb ram :( what should I check – Juan Carlos Oropeza Nov 05 '15 at 22:30
  • @FuzzyTree: I think the problem with the null can be solved by generating two values, one with lead and one with lag. One of them must be not-null –  Nov 05 '15 at 22:32
  • @a_horse_with_no_name i think the problem with using `lead` and `lag` is that the separate `lead` and `lag` calls don't have the same random order. So you will very likely have 1 duplicate value. You could solve the issue with union and an order column to force the 1st row to be a throwaway. Or another way is to select random() in a derived table to make sure the same order is used for each function. – FuzzyTree Nov 05 '15 at 22:33
  • @JuanCarlosOropeza: it's probably your harddisk. Here is my explain plan: http://explain.depesz.com/s/5Ilr you can probably improve the speed by increasing work_mem to avoid costly sorting on disk –  Nov 05 '15 at 22:33
  • @FuzzyTree: I think that can be solved by using the same window definition for both functions. –  Nov 05 '15 at 22:36
  • @a_horse_with_no_name I still think this query will duplicate 1 value because coalesce(lead,lag) doesn't wrap around for the 1st value. For example, the 1st element of the random list will not have a lead, so what we want is to select the last element of the random list - not the lag of the 1st value, which would be the 2nd element of the random list (duplicated by the lead of the 3rd value). – FuzzyTree Nov 05 '15 at 22:59
  • @a_horse_with_no_name In your fiddle you use 50k but you say in your laptop run 500k in 5 sec? I wonder if you mean 50k in your laptop too. – Juan Carlos Oropeza Nov 11 '15 at 16:54
  • @JuanCarlosOropeza: no, look at the explain output I had in my comment: "rows=500000" –  Nov 11 '15 at 18:04
  • Then I dont know why so much difference. I have try several combination 3 different server Unix, Windows 7 , Window Server 2008. Postgres 9.2, 9.3 9,4, normal hard disk scsi sd and all take more than 40 sec. top been 100 sec And changing work_mem doesnt help. – Juan Carlos Oropeza Nov 11 '15 at 18:08
  • @a_horse_with_no_name how you create your explain plan? I use F7 but mine doesnt show time http://explain.depesz.com/s/8XC – Juan Carlos Oropeza Nov 11 '15 at 20:18
  • @JuanCarlosOropeza `explain (analyze, verbose) select ...` –  Nov 11 '15 at 20:25
  • @a_horse_with_no_name I will create a question in dba.exchange, but want to make sure I can compare both aftert the multiple updates. I use your fiddle to create the db and execute the fiddle query, plans are similar [**Mine**](http://explain.depesz.com/s/Zlgv) doesnt use buffers and start with 345k rows [**Yours**](http://explain.depesz.com/s/Zlgv) – Juan Carlos Oropeza Nov 11 '15 at 21:18
  • @a_horse_with_no_name your SQLFiddle sample does not match the SQL in the answer here... – deed02392 Oct 23 '20 at 09:36
  • Use of coalesce loses null values (unless both values are null) which might not always be an issue depending your use case. – Petri Ryhänen Feb 25 '21 at 09:37
6

The problem with postgres is every update mean delete + insert

  • You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
  • You can turn off index so update are faster
  • But the best solution I use when need update all the rows is create the table again

.

CREATE TABLE new_table AS 
     SELECT * ....


DROP oldtable;

Rename new_table to old_table

CREATE index and constrains

Sorry that isnt an option for you :(

EDIT: After reading a_horse_with_no_name

looks like you need

with
first_names as (
    select row_number() over (order by random()) rn,
           first_name as new_first_name
    from person
),
ids as (
    select row_number() over (order by random()) rn, 
           id as ref_id
    from person
)
update person
set first_name = new_first_name
from first_names
join ids
  on first_names.rn = ids.rn
where id = ref_id;

Again for performance question is better if you provide the ANALYZE / EXPLAIN result.

Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
  • This is a correct formulation of what the OP intends. The performance is probably the same. – Gordon Linoff Nov 05 '15 at 22:26
  • This is actually surprisingly fast on my 500.000 rows test table: http://explain.depesz.com/s/Dwh –  Nov 05 '15 at 22:44
  • @a_horse_with_no_name Yes in my pc. OP query I stop it after 10 min, You query run in 150 sec mine in 90 sec. So mine looks 40% faster. Now have to find out why my pc is so slow. Finishing my desert and start checking that mem thing. – Juan Carlos Oropeza Nov 05 '15 at 22:49
  • @Juan Carlos Oropeza Thanks a lot! The query you posted in your EDIT runs in 6s on my table! – Serge Nov 06 '15 at 07:10
  • Good still dont know why my pc need 100 sec, I try the work_mem but make it worst. Just curious how long take with a_horse query in your table? – Juan Carlos Oropeza Nov 06 '15 at 11:51