62

I am trying a simple UPDATE table SET column1 = 0 on a table with about 3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min.

Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.

Any other ideas?

Update:

This is the table structure:

CREATE TABLE myTable
(
  id bigserial NOT NULL,
  title text,
  description text,
  link text,
  "type" character varying(255),
  generalFreq real,
  generalWeight real,
  author_id bigint,
  status_id bigint,
  CONSTRAINT resources_pkey PRIMARY KEY (id),
  CONSTRAINT author_pkey FOREIGN KEY (author_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT c_unique_status_id UNIQUE (status_id)
);

I am trying to run UPDATE myTable SET generalFreq = 0;

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
Ricardo
  • 7,785
  • 8
  • 40
  • 60
  • It would help to see the full table definition and the full query you are running. Also, the output of "explain analyze " would be very helpful. – kasperjj Jul 29 '10 at 10:07
  • Added the table structure. The command explain analyze UPDATE myTable SET generalFreq = 0; also take a very long time to complete. – Ricardo Jul 29 '10 at 10:33
  • do you by any chance have an index on generalFreq? – kasperjj Jul 29 '10 at 11:01
  • Oh.. and sorry, my mistake... you should run just explain, not explain analyze. That should return almost instantly. – kasperjj Jul 29 '10 at 11:04
  • ok, the explain returns the following: "Seq Scan on myTable (cost=0.00..181915.37 rows=5156537 width=1287)" What does it mean? – Ricardo Jul 29 '10 at 11:20
  • And no, I don't have an index on generalFreq. – Ricardo Jul 29 '10 at 11:20
  • still important with Postgres 14... No cpu, no disk, memory is plenty, threads are plenty and waiting is infinite. One practical approach with a single value column is to drop the column and add it back in with the desired value as default. In circumstances (usually a set of dependent views) this is hard to do, at least administratively, since you may have to drop and recreate views. Sometimes you can create an empty table like the one that needs updating, change the dependencies to this placeholder, drop and recreate column and finally connect the view(s) again. – Jan Jan 11 '23 at 19:44

10 Answers10

74

I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%). My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.

  1. Transactions of bunch of "UPDATE myTable SET myField=value WHERE myId=id"
    Gives 1,500 updates/sec. which means each run would take at least 18 hours.
  2. HOT updates solution as described here with FILLFACTOR=50. Gives 1,600 updates/sec. I use SSD's so it's a costly improvement as it doubles the storage size.
  3. Insert in a temporary table of updated value and merge them after with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM for each partition; 100,000 up/s otherwise. Cooool.
    Here is the sequence of operations:

CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));

Accumulate a bunch of updates in a buffer depending of available RAM When it's filled, or need to change of table/partition, or completed:

COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;

That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included. To save time, it's not necessary to make a vacuum FULL at the end but even a fast regular vacuum is usefull to control your transaction ID on the database and not get unwanted autovacuum during rush hours.

Andrei Volgin
  • 40,755
  • 6
  • 49
  • 58
Le Droid
  • 4,534
  • 3
  • 37
  • 32
  • 10
    I faced a similar problem today and by doing the change you describe (accumulate changes in temp table, then update from it), we saw a 100x increase in performance. 2.5 hours to about 2 minutes. –  Oct 25 '15 at 01:41
  • 8
    Sorry - i am not quite getting the buffer part. Could you add more info on how to "*Accumulate a bunch of updates in a buffer*"? – n1000 Apr 06 '16 at 07:04
  • 4
    @n1000 The buffer could be a CSV file or a [StringIO object](https://docs.python.org/3.5/library/io.html#io.StringIO) in Python. Each line of buffer will contains the same data structure than your "tempTable" to let the Postgresql [COPY command](http://www.postgresql.org/docs/current/static/sql-copy.html) works. – Le Droid Apr 12 '16 at 21:37
  • 1
    Thank you man. I tried this solution and it was supposed to cost nearly 30 hours and now it only need 77.1 s for the update command. Cheers! – whyisyoung Jan 06 '17 at 01:50
  • 1
    Did you verify that it was using HOT? You have to be careful because the modified column can't be a part of any index FWIW... :) See also https://blog.codacy.com/how-to-update-large-tables-in-postgresql-e9aecd197fb7 – rogerdpack Nov 23 '17 at 20:36
  • @rogerdpack Thanks, good link. I don't see why updated column(s) can't be part of an index? – Le Droid Nov 24 '17 at 16:49
  • Ok. They weren't in my test but HOT is almost useless on SSD as random access is as fast, wherever is the sector on drive (only 10% improvement due to cache hits). – Le Droid Nov 24 '17 at 19:51
  • **How many rows** does your **tempTable** contain in **each iteration**? Is 100,000 a good number (assuming I have 64 GB of ram)? Does the commit statement after the update statement bring any speed advantages, or could I also just do a commit at the very end, when I have finished with all the update chunks? – zingi Dec 09 '19 at 08:03
  • 1
    @zingi Yes 100,000 is a good number. Test it with 50K and 200K to see if it's make a difference but it's of that order of magnitude. In my environment 300K was a good choice. It's better to make regular commit, depending on the size of your transaction, it could fire autovacuum on other tables to not bypass the wrap-around protection and maximum transaction is 2 billions rows. I added a comment about the vacuum that could reduce overall time. – Le Droid Dec 09 '19 at 22:53
  • @LeDroid If I already have the tempTable in my database, is it still a good idea to update them in chunk? – cccfran Jul 21 '20 at 20:01
  • @cccfran I don't think it's useful, except if you want to display a progres meter to the user or don't want to lock the table for too long. I didn't check the effect of the update on memory, the chunks were necessary for the insertion in the temp table. Try it in one batch and tell us if you had a problem with it. – Le Droid Jul 23 '20 at 19:34
  • 1
    Why exactly is it faster to merge using UPDATE x FROM tempTable in chunks than doing it all at once? Concerning the MVCC model, it would still redo the row, right? – Joe Sep 07 '21 at 13:17
  • 1
    @Joe It's not faster, it's to manage resources (memory, transactions management, ...) if it implies big number of rows. – Le Droid Sep 17 '21 at 02:31
  • 1
    @LeDroid Thanks for the answer. So the time/effort does not scale linearly with the amount of rows? In that case, it makes sense why batching would be more efficient. May I suggest adding that to your answer? It's IMHO a very interesting and important part of the "why" that's missing. – Joe Sep 17 '21 at 08:56
  • @Joe : Good point, but it's a general good practice to use, for inserts too it makes a big difference on resources & performance. Specially if you limit temporary tables to have data corresponding to one specific partition you have to JOIN with. – Le Droid Sep 27 '21 at 20:23
  • what was PostgreSQL server resources where these data was tested ? cpu, ram, ssd (speed, iops) and also is there any configurations of postgres were changed ? – Sulaymon Hursanov Jan 11 '23 at 18:22
  • @SulaymonHursanov SSDs in raid0 (2GB/sec seq. read), 192G ram, single process on 20 cores pc, of course all postgresql config was revised for our needs & hardware. – Le Droid Jan 13 '23 at 13:29
  • This answer would be way more useful if it had all of the example in SQL – UnknownBeef Aug 29 '23 at 16:55
21

Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates

First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:

SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';

HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.

Ps. You need version 8.3 or better.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Frank Heikens
  • 117,544
  • 24
  • 142
  • 135
13

After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:

CREATE TABLE table2 AS 
SELECT 
  all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable

Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)

Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Ricardo
  • 7,785
  • 8
  • 40
  • 60
  • 8
    Postgresql's MVCC implementation makes updates expensive. If you're updating every row in the table, each row needs to be copied as a new version, and the old version marked as deleted. So it's not surprising that rewriting the table is faster (which is what altering the type of a column does automatically, for instance). not much you can do about it, just a performance characteristic to be aware of. – araqnid Jul 29 '10 at 17:40
  • Thanks for the explanation, araqnid. I didn't know postgresql did implement updates like that. – Ricardo Jul 29 '10 at 18:43
6

Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Tregoreg
  • 18,872
  • 15
  • 48
  • 69
  • 1
    If your data is "sparse" (lots of churn/free space within blocks) and the column being updated is not part of an index, then Postgres can use HOT and speed up by not needing to update indices. So this might help that way. See also https://dba.stackexchange.com/questions/15720/why-did-postgres-update-take-39-hours/84653#84653 ... Maybe it helps because an update in Postgres is equivalent to DELETE + INSERT and so it has to update all indexes for that row? Or maybe Postgres does bizarre things like rewriting full blocks worth of the index ("your personal copy of the index") until you commit :\ – rogerdpack Nov 24 '17 at 17:23
2

Try this (note that generalFreq starts as type REAL, and stays the same):

ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;

This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
  • I think this does do the index drop and replace trick, because for me it went from like 3 weeks to 1 night... – rogerdpack Oct 06 '20 at 21:47
1

How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.

If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.

Tom Gullen
  • 61,249
  • 84
  • 283
  • 456
  • 1
    Hi Tom, thanks. I am running a single update from the psql command line. I understand 3 million is a lot but in my experience with other databases, it shouldn't take more than 10 min. to run a single update in one numeric column. – Ricardo Jul 29 '10 at 11:00
  • I wouldn't of expected it to take so long either, especially with a constant assignment (setting all fields to 0), memory wise this should be pretty fast for a DB to handle. I've only limited experience with Postgres, but you could try doing it in batches of 100k and timing it to see how long you can expect the 3 million to run, it might just be the case Postgres isn't very good at this unusual operation. – Tom Gullen Jul 29 '10 at 11:06
1

The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:

 UPDATE myTable SET generalFreq = 0 where generalFreq != 0;

(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.

Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
1

In my tests I noticed that a big update, more than 200 000 rows, is slower than 2 updates of 100 000 rows, even with a temporary table.

My solution is to loop, in each loop create a temporary table of 200 000 rows, in this table I compute my values, then update my main table with the new values aso...

Every 2 000 000 rows, I manually "VACUUM ANALYSE mytable", I noticed that the auto vacuum doesn't do its job for such updates.

Rolintocour
  • 2,934
  • 4
  • 32
  • 63
1

I need to update more than 1B+ rows on PostgreSQL table which contains some indexes. I am working on PostgreSQL 12 + SQLAlchemy + Python.

Inspired by the answers here, I wrote a temp table and UPDATE... FROM based updater to see if it makes a difference. The temp table is then fed from CSV generated by Python, and uploaded over the normal SQL client connection.

The speed-up naive approach using SQLAlchemy's bulk_update_mappings is 4x - 5x. Not an order of magnitude, but still considerable and in my case this means 1 day, not 1 week, of a batch job.

Below is the relevant Python code that does CREATE TEMPORARY TABLE, COPY FROM and UPDATE FROM. See the full example in this gist.

def bulk_load_psql_using_temp_table(
        dbsession: Session,
        data_as_dicts: List[dict],
):
    """Bulk update columns in PostgreSQL faster using temp table.

    Works around speed issues on `bulk_update_mapping()` and PostgreSQL.
    Your mileage and speed may vary, but it is going to be faster.
    The observation was 3x ... 4x faster when doing UPDATEs
    where one of the columns is indexed.

    Contains hardcoded temp table creation and UPDATE FROM statements.
    In our case we are bulk updating three columns.

    - Create a temp table - if not created before

    - Filling it from the in-memory CSV using COPY FROM

    - Then performing UPDATE ... FROM on the actual table from the temp table

    - Between the update chunks, clear the temp table using TRUNCATE

    Why is it faster? I have did not get a clear answer from the sources I wa reading.
    At least there should be
    less data uploaded from the client to the server,
    as CSV loading is more compact than bulk updates.

    Further reading

    - `About PSQL temp tables <https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/>`_

    - `Naive bulk_update_mapping approach <https://stackoverflow.com/questions/36272316/using-bulk-update-mappings-in-sqlalchemy-to-update-multiple-rows-with-different>`_

    - `Discussion on UPDATE ... FROM + temp table approach <https://stackoverflow.com/questions/3361291/slow-simple-update-query-on-postgresql-database-with-3-million-rows/24811058#24811058>_`.

    :dbsession:
        SQLAlchemy session.
        Note that we open a separate connection for the bulk update.

    :param data_as_dicts:
        In bound data as it would be given to bulk_update_mapping
    """

    # mem table created in sql
    temp_table_name = "temp_bulk_temp_loader"

    # the real table of which data we are filling
    real_table_name = "swap"

    # colums we need to copy
    columns = ["id", "sync_event_id", "sync_reserve0", "sync_reserve1"]

    # how our CSV fields are separated
    delim = ";"

    # Increase temp buffer size for updates
    temp_buffer_size = "3000MB"

    # Dump data to a local mem buffer using CSV writer.
    # No header - this is specifically addressed in copy_from()
    out = StringIO()
    writer = csv.DictWriter(out, fieldnames=columns, delimiter=delim)
    writer.writerows(data_as_dicts)

    # Update data in alternative raw connection
    engine = dbsession.bind
    conn = engine.connect()

    try:
        # No rollbacks
        conn.execution_options(isolation_level="AUTOCOMMIT")

        # See https://blog.codacy.com/how-to-update-large-tables-in-postgresql/
        conn.execute(f"""SET temp_buffers = "{temp_buffer_size}";""")

        # Temp table is dropped at the end of the session
        # https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-temporary-table/
        # This must match data_as_dicts structure.
        sql = f"""
        CREATE TEMP TABLE IF NOT EXISTS {temp_table_name}
        (
          id int,
          sync_event_id int,
          sync_reserve0 bytea,
          sync_reserve1 bytea
        );    
        """
        conn.execute(sql)

        # Clean any pending data in the temp table
        # between update chunks.
        # TODO: Not sure why this does not clear itself at conn.close()
        # as I would expect based on the documentation.
        sql = f"TRUNCATE {temp_table_name}"
        conn.execute(sql)

        # Load data from CSV to the temp table
        # https://www.psycopg.org/docs/cursor.html
        cursor = conn.connection.cursor()
        out.seek(0)
        cursor.copy_from(out, temp_table_name, sep=delim, columns=columns)

        # Fill real table from the temp table
        # This copies values from the temp table using
        # UPDATE...FROM and matching by the row id.
        sql = f"""
        UPDATE {real_table_name}  
        SET 
            sync_event_id=b.sync_event_id,
            sync_reserve0=b.sync_reserve0,
            sync_reserve1=b.sync_reserve1        
        FROM {temp_table_name} AS b 
        WHERE {real_table_name}.id=b.id;
        """
        res = conn.execute(sql)
        logger.debug("Updated %d rows", res.rowcount)
    finally:
        conn.close()
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
-6

try

UPDATE myTable SET generalFreq = 0.0;

Maybe it is a casting issue

Mudassir Hasan
  • 28,083
  • 20
  • 99
  • 133
Chocolim
  • 54
  • 1
  • 2