0

I am attempting to make an update query run faster in Postgres. The query is relatively simple and I have busted it up to spread it across all of the CPU's on my database server.

UPDATE p797.line a SET p = 5 FROM p797.pt b WHERE a.source = b.node AND a.id >= 0 and a.id < 40000000

where "0" and "40000000" are replaced with different values as you move through all the rows in the table. The "line" table has 1.3 billion records and the pt table has 500 million.

Right now this process runs in about 16 hours. I have other update queries that I need to perform and if each takes 16 hours, the results will take weeks to acquire.

I found something interesting that I would like to try, but am unsure if it can be implemented in my case as I am running queries over a Network.

Slow simple update query on PostgreSQL database with 3 million rows

Here, Le Droid makes reference to COPY, a method which I believe I cannot employ as I am running over a network. They also use BUFFER, of which I do not understand how to employ. Also, both my tables reside in the same database, not as the combination database table and CSV. How can I massage my query to get the gains that @Le Droid mentions? Is there another methodology that I can employ to see time gains? I did see Le Droid mention that HOT only sees marginal gains with lots of cost. Other methods?

It might also be noteworthy that I am creating the queries in Python and sending them to the Postgres database using psycopg2.

EDIT:

Here is an EXPLAIN on the above statement without the ID limitation:

"Update on line a  (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
"  ->  Hash Join  (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
"        Hash Cond: (a.source= b.node)"
"        ->  Seq Scan on line a  (cost=0.00..52953645.38 rows=1381265438 width=102)"
"        ->  Hash  (cost=8347277.72..8347277.72 rows=126271072 width=22)"
"              ->  Seq Scan on pt b  (cost=0.00..8347277.72 rows=126271072 width=22)"
D_C
  • 370
  • 4
  • 22
  • Is 16 hours for just on chunk, or for all chunks? Please show the EXPLAIN (ANALYZE, BUFFERS) for the largest query that completes in reasonable time, and the EXPLAIN for the query which would update all rows at once. What is your work_mem set to? How much RAM do you have available? – jjanes Nov 15 '19 at 16:57
  • I have 12 cores, 64Gb of RAM. work_mem = "100MB". I am trying to figure out how to run the EXPLAIN(ANALYZE, BUFFERS) SQL – D_C Nov 15 '19 at 18:22
  • `EXPLAIN(ANALYZE, BUFFERS) UPDATE p797.line a SET p = 5 FROM p797.pt b WHERE a.source = b.node AND a.id >= 0 and a.id < 40000000`. Note that this will actually carry out the update. If you run without ANALYZE, it will not carry out the update, just plan out how to do it. – jjanes Nov 15 '19 at 18:50

2 Answers2

0

Frankly I'd extract the data, apply all transformations outside the database, then reload it. So an ETL, but with the E and the L being the same table. The transactional guarantees the database provides do not come cheap, and if I didn't need them I wouldn't want to pay that price in this situation.

jjanes
  • 37,812
  • 5
  • 27
  • 34
0

dropping all indexes from the table you are updating has a tremendous performance enhancement for updates, like 100x faster. Even if the indexes are not related to the columns you are updating or joining on.