2

For research purposes (e.g. outsourcing to 3rd party data scientist community group), I need to export production db, masking certain sensitive fields (e.g. customer name, phone number, address, etc.).

Now, since the production db on this order_requests table has around 500 million rows, I'd like to mask field nickname to nickname_transformed. How do I do it fast?

order_requests table structure:

┌──────────────────┬─────────────────────────────┬────────────────────┐
│      Column      │            Type             │Modifiers           │
├──────────────────┼─────────────────────────────┼────────────────────┤
│ id               │ integer                     │ not null default   │
│                  │        nextval('order_requests_id_seq'::regclass)│
│ vehicle_cd       │ integer                     │                    │
│ nickname         │ character varying(255)      │                    │
│ phone_number     │ character varying(255)      │                    │
│ pickup_time      │ timestamp without time zone │                    │

... 20+ fields more ...

└──────────────────┴─────────────────────────────┴────────────────────┘
Indexes:
    "order_requests_pkey" PRIMARY KEY, btree (id)
    ... 15+ indexes more ...
Foreign-key constraints:
    ... 7 foreign keys ...
Referenced by:
    ... 25+ references more ...

My current implementation with dblink (6 hours to complete, whereas CPU on DB only < 10%; standalone db used only myself):

CREATE EXTENSION dblink;
ALTER TABLE
    order_requests ADD nickname_transformed VARCHAR;
ALTER TABLE order_requests DISABLE TRIGGER USER;
CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(id), max(id) FROM order_requests;
                                         -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 1000) + 1;  -- rounded, possibly a bit too small
                                         -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('postgres://username:password@localhost:5432/dbname');  -- your foreign server as instructed above

   FOR i IN 0..2000 LOOP                 -- 2000 >> 1000 to make sure we exceed _max
      PERFORM dblink_exec(
       $$UPDATE order_requests
         SET    nickname_transformed = md5(nickname)
         WHERE  id >= $$ || _cur || $$
         AND    id <  $$ || _cur + _step || $$
         AND    true$$);  -- avoid empty update
      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

Some questions in my mind also:

  1. Will it faster if I create an another table with only clone id, nickname fields + empty nickname_transformed field, do my operation there, then copy back nickname_transformed?
  2. This article mentioned more complex way to achieve the result. Is there any solid code example for this? Something like how to manage dblink connections and link to their respective operations, notify, etc.

Have read this:

Pure function / query is preferred since have to redo this job every 1-2 months.

Community
  • 1
  • 1
Ming Hsieh
  • 713
  • 2
  • 8
  • 29

3 Answers3

3

When I need to update all rows in such a large table I use the following procedure:

  • drop all indexes on the table, leaving only these required by foreign keys to the table;
  • perform the update - as it does not need to update the indexes it's much faster;
  • vacuum full the table - to get back the space used up by old row versions;
  • recreate all the indexes - this can be done in parallel - I use xargs -P for this.

md5(nickname) does not provide sufficient anonymity - it's trivially easy to restore back the original nicknames with some brute force or rainbow tables search - they don't have enough entropy.

You should rather do substring(md5(nickname||'some-long-secret-string') for 8):

  • a secret would not be known by your contractor, so it would remove the ability to brute force all the nicknames;
  • a truncated hash would be too short for reliable reverting of nicknames even if your secret would leak out.
Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • this one covers some essentials but not complete (e.g. having low cpu, enough memory, and single dedicated host to utilise). – Ming Hsieh Mar 17 '17 at 22:37
0

Why not:

update order_requests set nickname = id;

and forget about nickname_transformed?

If you must have nickname_transformed, that maintains the same obscured value for a given nickname, then:

create table nickname_hash (
    nickname varchar(255),
    hashed char(32)
);
insert into nickname_hash
select distinct nickname from order_requests;
update nickname_hash set hashed = md5(nickname);
create index idx1 on nickname_hash(nickname, hashed);
update o set
o.nickname_transformed = h.hashed
from order_requests o, nickname_hash h
where o.nickname = h.nickname;
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

If you want to work with realistic data, you can replace the real data with fake data with a script like https://github.com/joke2k/faker/

This one is in Python but similar scripts exists in many languages

Gab
  • 3,404
  • 1
  • 11
  • 22