I have a Postgres table with several billion rows, and for a machine learning application need to divide it into train and test sets.
I want the test rows to be mostly contiguous by their id
column, so would like to randomly select
several chunks of 1,000 contiguous rows each and mark them as test rows. I have an index on the id
column, so selecting any arbitrary 1,000 contiguous rows is fast:
UPDATE table SET test=true WHERE id BETWEEN 100000 AND 101000;
is very efficient, and uses an index scan as you would expect. Unfortunately as soon as I make the initial id
be generated randomly, i.e.
WITH off AS (SELECT ROUND(random()*maxId))
UPDATE table SET test=true
WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
AND (SELECT * FROM off LIMIT 1)+1000;
The query planner now decides to do a full table scan (much slower).
Of course if I only had to do this once, I would just manually generate a random row, no problem. However in the end I want a function that automatically divides into test and train like the one below:
CREATE OR REPLACE FUNCTION test_train_divide(chunkSize integer, proportion real)
RETURNS BOOLEAN
AS $$
DECLARE
maxId INTEGER := (SELECT MAX(id) FROM table);
BEGIN
FOR i IN 1 .. round(maxId*proportion/chunkSize) LOOP
RAISE NOTICE 'Update call %', i;
WITH off AS (SELECT ROUND(random()*maxId))
UPDATE table SET test=true
WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
AND (SELECT * FROM off LIMIT 1)+chunkSize;
END LOOP;
return true;
END;
$$ LANGUAGE plpgsql;
SELECT test_train_divide(1000,0.01);
This works, but is insanely slow! Any pointers?
Update
Here is the schema
tbl "public.tbl”
Column | Type | Modifiers
-----------+---------+-----------
subid | integer |
id | bigint |
wordid | integer |
capid | integer |
test | boolean |
Indexes:
“tbl_id_idx" btree (id)
And here are two different query plans, one good (using the index) and one bad:
will=# EXPLAIN UPDATE tbl SET test=true WHERE id BETWEEN 1000000 AND 1001000;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Update on tbl (cost=0.57..790.45 rows=1079 width=38)
-> Index Scan using tbl_id_idx on tbl (cost=0.57..790.45 rows=1079 width=38)
Index Cond: ((id >= 1000000) AND (id <= 1001000))
(3 rows)
will=# EXPLAIN WITH start AS (SELECT round(random()*max(id)) FROM tbl) UPDATE tbl c SET test=true WHERE c.id BETWEEN (SELECT * FROM start LIMIT 1) AND (SELECT * FROM start LIMIT 1)+1000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Update on tbl c (cost=0.65..14932243.97 rows=1459961 width=38)
CTE start
-> Result (cost=0.59..0.61 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..0.59 rows=1 width=8)
-> Index Only Scan Backward using tbl_id_idx on tbl (cost=0.57..5846291.90 rows=288468819 width=8)
Index Cond: (id IS NOT NULL)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.02 rows=1 width=8)
-> CTE Scan on start (cost=0.00..0.02 rows=1 width=8)
InitPlan 4 (returns $3)
-> Limit (cost=0.00..0.02 rows=1 width=8)
-> CTE Scan on start start_1 (cost=0.00..0.02 rows=1 width=8)
-> Seq Scan on tbl c (cost=0.00..14932243.32 rows=1459961 width=38)
Filter: (((id)::double precision >= $2) AND ((id)::double precision <= ($3 + 1000::double precision)))
(15 rows)
Time: 2.649 ms