2

I have a simple table:

create table test (i int4 primary key);

where there is million rows, with i >= 1 and i <= 1000000.

I want to remove ~ 80% of the rows - so something like: delete from test where random() < 0.8, but I want the delete to have higher chance of removal for lower i values.

Technically: delete from test where i < 800000 does it, but I want deleted rows to be random, and still want some of the "high-pkey" rows to be removed, and some (just much less) of the "low-pkey" to be kept.

Any idea on how to get it?

4 Answers4

1

Something like this ?

create table ztest (val int4 primary key);

INSERT INTO ztest (val) SELECT gs FROM generate_series(1,1000) gs;

DELETE FROM ztest
WHERE (val >0 AND val <= 10 and random() < 0.1)
OR (val >10 AND val <= 100 and random() < 0.5)
OR (val >100 AND val <= 1000 and random() < 0.9)
        ;

SELECT * FROM ztest;

UPDATE: (but hard to tune ...)

DELETE FROM ztest
WHERE ( log(3+val) * random() < .5)
        ;

[ the +3 is a very rude way to avoid log(1), which would always delete the record with val=1 ]

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I was thinking of something like this, but trying to come up with a way that didn't need conditional logic. Also, this would give the greatest chance of selecting a number to the larger numbers, so picking ones to keep, not choosing those to delete. – Hart CO Jun 19 '13 at 16:13
  • I would also recommend create a temp table with the indexes that you are intending to delete first to allow inspection and check that the number is close enough to the number you want to delete and that the random spread is to your liking. – Chris Jun 19 '13 at 16:15
  • It would be nice to have an unstratified version without the ugly and/or list, and a expected overall passtru of 80% . Still trying ... – wildplasser Jun 19 '13 at 16:17
  • After running it as provided, it removed ~ 10% of rows (the log(3+) approach. The one with complicated where - sure it would work, but I need sometging less tedious to write. –  Jun 19 '13 at 16:33
  • Sorry, I tested with N=1000, deleting approximately 200 rows. You might need to tweak it a little. To me it seems a bit steep, though (too many low values are hit, too few high ones. Maybe it needs an extra term or coefficient somewhere ...) BTW: I tend to do is kind of stuff using multiple runs: first delete 20 % from all, then delete 20 % from the lower 100K, 20 % from the lower 10K, etc. – wildplasser Jun 19 '13 at 16:37
1

AnSo you need to assign a weight to i. Since you know you have 1000000 rows this should be easy.

delete from test where random < .8 + ((500000 - i) / 10000000)

In the example above the lowest value for i has a ~85% chance of being deleted whereas the highest has ~75% chance. Of course this won't produce exactly 80%, but you just wanted approximate. You can adjust the denominator to fit your purposes, and of course come up with a more advanced weighting scheme.

Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • It removed almost all of the highest id rows. And in the rest of the table the distribution is more or less linear. It can be tried with: `create table test as select generate_series(1,100000) as id; delete from test where random() < .8 + ((50000 - id) / 1000000); select id/5000, count(*) from test group by 1 order by 1;` The idea is that in the last select, count should be getting higher for each row. –  Jun 19 '13 at 16:31
  • Yeah, my test resulted in a similarly linear distribution. – Hart CO Jun 19 '13 at 16:35
1

With normally-distributed data, starting at 1, this works:

delete from test where random() + 0.1 * (500000 - id) / 500000 > 0.2;

This should have about a 90% chance to remove the lowest ID, and a 70% chance to remove the highest.

If your data is not distributed normally you can accomplish the same thing by using rank() over (order by id) in place of id but this would be much slower.

Corey
  • 1,532
  • 9
  • 12
0

A very simple and effective method to get such a sloped probability is to square random() (or take random()^3 for an even stronger effect ..).

Building on this premise, this function would yield a "perfect result":

CREATE OR REPLACE FUNCTION f_del()
  RETURNS void AS
$func$
DECLARE
   _del_pct CONSTANT real := 0.8;  -- percentage to delete
   _min        int;                -- minimum i in table
   _span       int;                -- diff. to maximum i
   _ct         int;                -- helper var.
   _del_target int;                -- number rows to be deleted
BEGIN

SELECT INTO _min, _span, _del_target
             min(i), max(i) - min(i), (count(*) * _del_pct)::int FROM tbl;

LOOP
   DELETE FROM tbl t
   USING (
      SELECT DISTINCT i
      FROM (
         SELECT DISTINCT _min + (_span * random()^2)::int AS i -- square it
         FROM   generate_series (1, _del_target * 3)  -- good estimate for 80%
         ) num                    -- generate approx. more than enough numbers
      JOIN   tbl USING (i)
      LIMIT  _del_target          -- prohibit excess dynamically
      ) x
   WHERE t.i = x.i;

   GET DIAGNOSTICS _ct = ROW_COUNT;
   _del_target := _del_target - _ct;

   EXIT WHEN _del_target <= 0;
END LOOP;

END $func$ LANGUAGE plpgsql;

Call:

SELECT f_del();

->SQLfiddle

This should work perfectly

  • with or without gaps in the number space
    (Adapted _del_target to use count() instead of _span, so this works, too.)
  • with any minimum and maximum number
  • with any number of rows

The line

JOIN   tbl USING (i)

.. is only really useful if you have a lot of gaps or a bad initial estimate for generate_series(). Can be removed for the case at hand for more speed (and still exact results).

If you pick the initial limit for generate_series() carefully, the function will not loop at all.

I think it's safe to assume that I need not tell you how to generalize this further to work with dynamic table names or percentages.

It's somewhat similar to this answer:
Best way to select random rows PostgreSQL


For just this case, the simple SQL command would work a bit faster:

DELETE FROM tbl t
USING (
   SELECT DISTINCT (1000000 * random()^2)::int AS i
   FROM   generate_series (1, 2130000)
   ) x
WHERE t.i = x.i;
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Please check this on 1 million row table, vs. Coray code. And check times. –  Jun 19 '13 at 18:12
  • @depesz: My initial estimate for `generate_series()` was too low, which caused many loops. With the adapted estimate it doesn't loop and finishes in **10 sec** for 1 mio. rows now on my old test server. Corey's code only takes 1.5 sec. But my method is a more exact and versatile .. depending on what you need. – Erwin Brandstetter Jun 19 '13 at 18:23