For research purposes (e.g. outsourcing to 3rd party data scientist community group), I need to export production db, masking certain sensitive fields (e.g. customer name, phone number, address, etc.).
Now, since the production db on this order_requests
table has around 500 million rows, I'd like to mask field nickname
to nickname_transformed
. How do I do it fast?
order_requests
table structure:
┌──────────────────┬─────────────────────────────┬────────────────────┐
│ Column │ Type │Modifiers │
├──────────────────┼─────────────────────────────┼────────────────────┤
│ id │ integer │ not null default │
│ │ nextval('order_requests_id_seq'::regclass)│
│ vehicle_cd │ integer │ │
│ nickname │ character varying(255) │ │
│ phone_number │ character varying(255) │ │
│ pickup_time │ timestamp without time zone │ │
... 20+ fields more ...
└──────────────────┴─────────────────────────────┴────────────────────┘
Indexes:
"order_requests_pkey" PRIMARY KEY, btree (id)
... 15+ indexes more ...
Foreign-key constraints:
... 7 foreign keys ...
Referenced by:
... 25+ references more ...
My current implementation with dblink
(6 hours to complete, whereas CPU on DB only < 10%; standalone db used only myself):
CREATE EXTENSION dblink;
ALTER TABLE
order_requests ADD nickname_transformed VARCHAR;
ALTER TABLE order_requests DISABLE TRIGGER USER;
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(id), max(id) FROM order_requests;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 1000) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('postgres://username:password@localhost:5432/dbname'); -- your foreign server as instructed above
FOR i IN 0..2000 LOOP -- 2000 >> 1000 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE order_requests
SET nickname_transformed = md5(nickname)
WHERE id >= $$ || _cur || $$
AND id < $$ || _cur + _step || $$
AND true$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Some questions in my mind also:
- Will it faster if I create an another table with only clone
id
,nickname
fields + emptynickname_transformed
field, do my operation there, then copy backnickname_transformed
? - This article mentioned more complex way to achieve the result. Is there any solid code example for this? Something like how to manage
dblink
connections and link to their respective operations,notify
, etc.
Have read this:
- What is the fastest way to apply 150M updates on PostgreSQL table
- How do I do large non-blocking updates in PostgreSQL?
Pure function / query is preferred since have to redo this job every 1-2 months.