How to Update a single column in postgres in a batch for 55 Million records

Question

I want to update a column of postgres table. Records are around 55 Million so we need to update it in a batch of 10000 records. Note: we want to update all rows. But we don't want to lock our table.

I am trying below query -

Update account set name = Some name where id between 1 and 10000

How we can make a loop for every 10000 records update?

Any suggestions and help will be appreciated.

PostgreSQL 10.5

Jim Jones · Answer 1 · 2023-03-06T16:43:57.110

Instead of committing all changes at once (or 55 million times as suggested in the other answers), I would rather try splitting the update rows in small batches, e.g. 10k records as you suggested. In PL/pgSQL it is possible to iterate over a collection with a given step using the keyword BY. So you could do your batch update in an anonymous code block like this:

PostgreSQL 11+

DO $$ 
DECLARE 
  page int := 10000;
  min_id bigint; max_id bigint;
BEGIN
  SELECT max(id),min(id) INTO max_id,min_id FROM account;
  FOR j IN min_id..max_id BY page LOOP 
    UPDATE account SET name = 'your magic goes here'
    WHERE id >= j AND id < j+page;
    COMMIT;            
  END LOOP;
END; $$;

You might want to adjust the WHERE clause to avoid unnecessary overlaps.

Testing

Data Sample with 1051 rows with sequential ids:

CREATE TABLE account (id int, name text);
INSERT INTO account VALUES(generate_series(0,1050),'untouched record..');

Executing anonymous code block ...

DO $$ 
DECLARE 
  page int := 100;
  min_id bigint; max_id bigint;
BEGIN
  SELECT max(id),min(id) INTO max_id,min_id FROM account;
  FOR j IN min_id..max_id BY page LOOP 
    UPDATE account SET name = now() ||' -> UPDATED ' || j  || ' to ' || j+page
    WHERE id >= j AND id < j+page;
    RAISE INFO 'committing data from % to % at %', j,j+page,now();
    COMMIT;            
  END LOOP;
END; $$;
    
INFO:  committing data from 0 to 100 at 2021-04-14 17:35:42.059025+02
INFO:  committing data from 100 to 200 at 2021-04-14 17:35:42.070274+02
INFO:  committing data from 200 to 300 at 2021-04-14 17:35:42.07806+02
INFO:  committing data from 300 to 400 at 2021-04-14 17:35:42.087201+02
INFO:  committing data from 400 to 500 at 2021-04-14 17:35:42.096548+02
INFO:  committing data from 500 to 600 at 2021-04-14 17:35:42.105876+02
INFO:  committing data from 600 to 700 at 2021-04-14 17:35:42.114514+02
INFO:  committing data from 700 to 800 at 2021-04-14 17:35:42.121946+02
INFO:  committing data from 800 to 900 at 2021-04-14 17:35:42.12897+02
INFO:  committing data from 900 to 1000 at 2021-04-14 17:35:42.134388+02
INFO:  committing data from 1000 to 1100 at 2021-04-14 17:35:42.13951+02

.. you can update your rows in batches. To make my point clearer, the following query counts the number of records grouped by update time:

SELECT DISTINCT ON (name) name, count(id)
FROM account 
GROUP BY name ORDER BY name;

                         name                         | count 
------------------------------------------------------+-------
 2021-04-14 17:35:42.059025+02 -> UPDATED 0 to 100    |   100
 2021-04-14 17:35:42.070274+02 -> UPDATED 100 to 200  |   100
 2021-04-14 17:35:42.07806+02 -> UPDATED 200 to 300   |   100
 2021-04-14 17:35:42.087201+02 -> UPDATED 300 to 400  |   100
 2021-04-14 17:35:42.096548+02 -> UPDATED 400 to 500  |   100
 2021-04-14 17:35:42.105876+02 -> UPDATED 500 to 600  |   100
 2021-04-14 17:35:42.114514+02 -> UPDATED 600 to 700  |   100
 2021-04-14 17:35:42.121946+02 -> UPDATED 700 to 800  |   100
 2021-04-14 17:35:42.12897+02 -> UPDATED 800 to 900   |   100
 2021-04-14 17:35:42.134388+02 -> UPDATED 900 to 1000 |   100
 2021-04-14 17:35:42.13951+02 -> UPDATED 1000 to 1100 |    51

Demo: db<>fiddle

I am running your query, getting error. Error: cannot begin/end transaction in pl/pgsql Context: pl/pgsql function inline_code_block line 9 at SQL statement SQL state: 0A000 — codegeek, Apr 14 '21 at 16:00
@codegeek that's odd. Which postgres version are you using? Are you perhaps running the code block together with other queries in the same transaction? That would raise an exception! See my db<>fiddle, I separated the `do` block from the other statements. It has to work :-D — Jim Jones, Apr 14 '21 at 17:21
Yes thanks. I saw but my version is 10.5 On fiddle it is working for version above than 10. I checked it is working for 11, 12, and 13. Any solution for lower version. :) — codegeek, Apr 14 '21 at 17:27
@codegeek that's much unfortunate. In old versions like 10 it won't work. You can try to catch the exception, but it is an ugly workaround. For these kind of questions always mention which version you're using. In case you're using linux, you can create a bash script and easily reproduce this loop. Other than that I don't see any elegant solution for you. — Jim Jones, Apr 14 '21 at 22:14
@JimJones I love your answer but I almost blow up DB, your query has an typo, should be `AND id < j+page` instead of `AND j < j+page`. — Kamil Dziedzic, Sep 08 '21 at 15:05
One problem I've noticed is that process eats whole memory: 114522 postgres 20 0 98G 33.4G 33.4G R 99.1 10.3 3:24.90. No idea why this happens... — Kamil Dziedzic, Sep 09 '21 at 12:47
@KamilDziedzic is the column in the where clause properly indexed? — Jim Jones, Sep 09 '21 at 14:31
Yes, primary key. Not sure what was the issue and I don't bother for now as I'm done with the update. Your query was very useful. Thanks again! — Kamil Dziedzic, Sep 11 '21 at 21:27

score 2 · Answer 2 · answered Apr 14 '21 at 12:43

You could use a procedure (available as of version 11) and do it one by one, like this:

CREATE or replace PROCEDURE do_update()
LANGUAGE plpgsql
AS $$
BEGIN
    FOR i IN 1..55000000 -- 55 million, or whatever number you need
    LOOP 

        Update account set name = Some name where id = i;
        COMMIT;
        
        RAISE INFO 'id: %', i;
    END LOOP;
END;
$$;

CALL do_update();

score 0 · Answer 3 · answered Apr 14 '21 at 13:11

Setting up testing evnironment:

DROP TABLE IF EXISTS account;
CREATE TABLE account(id integer, name text);

INSERT INTO account
VALUES (1, 'jonas'),(10002, 'petras');

Update script:

DO $$
DECLARE
  _id integer;
  _min_id integer;
  _max_id integer; 
  _batch_size integer = 10000;
BEGIN
  SELECT 
    MIN(id),
    MAX(id)
  INTO
    _min_id,
    _max_id
  FROM
    account;

  _id := _min_id;

  LOOP
    UPDATE account SET
      name = 'Some name' 
    WHERE id >=_id 
      AND id < _id + _batch_size;

    COMMIT;

    _id := _id + _batch_size;
    IF _id > _max_id THEN
      EXIT;
    END IF;
  END LOOP;
END;
$$;

Erwin Brandstetter · Answer 4 · 2021-04-19T01:28:07.097

But we don't want to lock our table.

Makes sense in many sistuations But you did not disclose your actual setup. Do you even need a lock? Is there concurrent write activity? If not, is there enough storage to write another copy of the table? Then it might be preferable to build a new pristine, updated table in the background, then switch, and delete the old one. See:

Best way to populate a new column in a large table?

Assuming concurrent write activity to the table. And you don't want to block too much of it for too long. And you want to reuse dead tuples to keep table bloat and index bloat at bay. So updating in batches makes sense. You have to COMMIT (and VACUUM) between batches so that space occupied by dead tuples can be reused. And spread out writes across the table to allow consecutive transactions to produce and consume dead tuples in same blocks.

Transaction control statements (like COMMIT) are allowed in procedures or anonymous code blocks in DO statements in Postgres 11 or newer. Others answer provided solutions using that.

autovacuum should be running with aggressive settings, to release dead tuples for reuse in time. Or run VACUUM manually at some intervals - but that cannot (currently) be run in transaction context at all (only as individual command), so not possible in a PL/pgSQL loop.

Postgres 10 or older

No transaction control in code block allowed, yet. We can emulate autonomous transactions with dblink, though. See:

Could look like:

DO
$do$
DECLARE
   _cur  int := 0;  -- just start with 0 unless min is far off
   _step int := 10000;  -- batch size
   _max  CONSTANT int := (SELECT max(id) FROM account);  -- max id
   _val  CONSTANT text := 'SOME name';
BEGIN
   -- as superuser, or you must also provide the password for the current role;
   PERFORM dblink_connect('dbname=' || current_database());  -- current db

   LOOP
      RAISE NOTICE '%', _cur;
      PERFORM dblink_exec(  -- committed implicitly!
         $$
         UPDATE account
         SET    name = 'SOME name'
         WHERE  id BETWEEN _cur AND _cur + _step   -- gaps don't matter unless huge
         AND    name IS DISTINCT FROM 'SOME name'  -- avoid empty updates
         $$);

      
      _cur := _cur + _step;
      EXIT WHEN _cur > _max;          -- stop when done
   END LOOP;

   PERFORM dblink_disconnect();
END
$do$;

I also added another predicate:

     AND    name IS DISTINCT FROM 'SOME name'  -- avoid empty updates

To skip the cost for empty updates where the row already has the new name. Only useful if that can happen. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

You may want to split it up further, and run VACUUM in between. And you may want to use some other column for selection than id (one that is not clustered) to get a good spread across the whole table.

How to Update a single column in postgres in a batch for 55 Million records

4 Answers4

PostgreSQL 11+

Testing

Postgres 10 or older

Linked