Efficiently mark contiguous subsets of rows in postgres

Question

I have a Postgres table with several billion rows, and for a machine learning application need to divide it into train and test sets.

I want the test rows to be mostly contiguous by their id column, so would like to randomly select several chunks of 1,000 contiguous rows each and mark them as test rows. I have an index on the id column, so selecting any arbitrary 1,000 contiguous rows is fast:

UPDATE table SET test=true WHERE id BETWEEN 100000 AND 101000;

is very efficient, and uses an index scan as you would expect. Unfortunately as soon as I make the initial id be generated randomly, i.e.

WITH off AS (SELECT ROUND(random()*maxId))
  UPDATE table SET test=true
    WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
                 AND (SELECT * FROM off LIMIT 1)+1000;

The query planner now decides to do a full table scan (much slower).

Of course if I only had to do this once, I would just manually generate a random row, no problem. However in the end I want a function that automatically divides into test and train like the one below:

CREATE OR REPLACE FUNCTION test_train_divide(chunkSize integer, proportion real)
RETURNS BOOLEAN
AS $$
DECLARE
maxId INTEGER := (SELECT MAX(id) FROM table);
BEGIN
FOR i IN 1 .. round(maxId*proportion/chunkSize) LOOP
  RAISE NOTICE 'Update call %', i;
  WITH off AS (SELECT ROUND(random()*maxId))
  UPDATE table SET test=true
    WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
                 AND (SELECT * FROM off LIMIT 1)+chunkSize;
END LOOP;
return true;
END;
$$ LANGUAGE plpgsql;

SELECT test_train_divide(1000,0.01);

This works, but is insanely slow! Any pointers?

Update

Here is the schema

    tbl "public.tbl”
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 subid     | integer | 
 id        | bigint  | 
 wordid    | integer | 
 capid     | integer | 
 test      | boolean | 
Indexes:
    “tbl_id_idx" btree (id)

And here are two different query plans, one good (using the index) and one bad:

will=# EXPLAIN UPDATE tbl SET test=true WHERE id BETWEEN 1000000 AND 1001000;

                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Update on tbl  (cost=0.57..790.45 rows=1079 width=38)
   ->  Index Scan using tbl_id_idx on tbl  (cost=0.57..790.45 rows=1079 width=38)
         Index Cond: ((id >= 1000000) AND (id <= 1001000))
(3 rows)


will=# EXPLAIN WITH start AS (SELECT round(random()*max(id)) FROM tbl) UPDATE tbl c SET test=true WHERE c.id BETWEEN (SELECT * FROM start LIMIT 1) AND (SELECT * FROM start LIMIT 1)+1000;



                                                             QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Update on tbl c  (cost=0.65..14932243.97 rows=1459961 width=38)
   CTE start
     ->  Result  (cost=0.59..0.61 rows=1 width=0)
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.57..0.59 rows=1 width=8)
                   ->  Index Only Scan Backward using tbl_id_idx on tbl  (cost=0.57..5846291.90 rows=288468819 width=8)
                         Index Cond: (id IS NOT NULL)
   InitPlan 3 (returns $2)
     ->  Limit  (cost=0.00..0.02 rows=1 width=8)
           ->  CTE Scan on start  (cost=0.00..0.02 rows=1 width=8)
   InitPlan 4 (returns $3)
     ->  Limit  (cost=0.00..0.02 rows=1 width=8)
           ->  CTE Scan on start start_1  (cost=0.00..0.02 rows=1 width=8)
   ->  Seq Scan on tbl c  (cost=0.00..14932243.32 rows=1459961 width=38)
         Filter: (((id)::double precision >= $2) AND ((id)::double precision <= ($3 + 1000::double precision)))
(15 rows)

Time: 2.649 ms

similar questions asked previously "efficiently select random rows" — Mitch Wheat, Dec 02 '14 at 23:51
For performance questions, we need your version of Postgres. And a basic table definition is always useful (the `CREATE TABLE` statement or `\d tbl` in psql). — Erwin Brandstetter, Dec 03 '14 at 00:04
OK I've given some more info, Postgres 9.3, and the table definition is now in the question — wxs, Dec 03 '14 at 00:09

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

After initializing max_id as max(id) - 1000 to leave room for 1000 rows, this should be using the index:

UPDATE table
SET    test = true
FROM  (SELECT (random() * max_id)::bigint AS lower_bound) t
WHERE  id BETWEEN t.lower_bound AND t.lower_bound + 999;

No need for the complicated structure with a CTE and subqueries. Use a single subquery.
You original calculation yields a numeric (or dp), which may not go well with an index on a bigint column. Cast to bigint. (Should not be a problem in pg 9.3.)
BETWEEN includes lower and upper bound. Your upper bound should be lower + 999, strictly speaking.
random() returns (per documentation) a random value in the range 0.0 <= x < 1.0. To be completely fair, your lower_bound should really be calculated like this (assuming no gaps):
```
  trunc(random() * max_id)::bigint + 1
```

If you need truly random numbers (or if your id has gaps), consider this related answer:

Best way to select random rows PostgreSQL

Maybe advisory locks or a different approach may be useful. Compare this related, later answer:

How to mark certain nr of rows in table on concurrent access

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 02 '14 at 23:56

Erwin Brandstetter

605,456
145
1,078
1,228

Perfect: that seems to do it! At least as far as the query planner goes. Your other issues are well taken too. It's not clear to me which part fixed the query plan, since there are many changes, do you think it's the lack of CTE? – wxs Dec 03 '14 at 00:14
1

@wxs: I think it's the two subqueries in `BETWEEN`, which are confusing in multiple ways, starting with `SELECT *` ... – Erwin Brandstetter Dec 03 '14 at 00:16
1

shouldn't the start be `trunc(random() * max_id-1000)::bigint + 1` so that en end will be in-range. – Jasen Mar 22 '15 at 20:19
@Jasen: True. You would need parentheses, though: `random() * (max_id-1000)`. Better yet, init `max_id` accordingly. Updated with your feedback. – Erwin Brandstetter Mar 23 '15 at 01:31

Efficiently mark contiguous subsets of rows in postgres

Update

1 Answers1