A very simple and effective method to get such a sloped probability is to square random()
(or take random()^3
for an even stronger effect ..).
Building on this premise, this function would yield a "perfect result":
CREATE OR REPLACE FUNCTION f_del()
RETURNS void AS
$func$
DECLARE
_del_pct CONSTANT real := 0.8; -- percentage to delete
_min int; -- minimum i in table
_span int; -- diff. to maximum i
_ct int; -- helper var.
_del_target int; -- number rows to be deleted
BEGIN
SELECT INTO _min, _span, _del_target
min(i), max(i) - min(i), (count(*) * _del_pct)::int FROM tbl;
LOOP
DELETE FROM tbl t
USING (
SELECT DISTINCT i
FROM (
SELECT DISTINCT _min + (_span * random()^2)::int AS i -- square it
FROM generate_series (1, _del_target * 3) -- good estimate for 80%
) num -- generate approx. more than enough numbers
JOIN tbl USING (i)
LIMIT _del_target -- prohibit excess dynamically
) x
WHERE t.i = x.i;
GET DIAGNOSTICS _ct = ROW_COUNT;
_del_target := _del_target - _ct;
EXIT WHEN _del_target <= 0;
END LOOP;
END $func$ LANGUAGE plpgsql;
Call:
SELECT f_del();
->SQLfiddle
This should work perfectly
- with or without gaps in the number space
(Adapted _del_target
to use count()
instead of _span
, so this works, too.)
- with any minimum and maximum number
- with any number of rows
The line
JOIN tbl USING (i)
.. is only really useful if you have a lot of gaps or a bad initial estimate for generate_series()
. Can be removed for the case at hand for more speed (and still exact results).
If you pick the initial limit for generate_series()
carefully, the function will not loop at all.
I think it's safe to assume that I need not tell you how to generalize this further to work with dynamic table names or percentages.
It's somewhat similar to this answer:
Best way to select random rows PostgreSQL
For just this case, the simple SQL command would work a bit faster:
DELETE FROM tbl t
USING (
SELECT DISTINCT (1000000 * random()^2)::int AS i
FROM generate_series (1, 2130000)
) x
WHERE t.i = x.i;