12

on postgresl 9.0 we have a sql query:

SELECT count(*) FROM lane 
WHERE not exists 
    (SELECT 1 FROM id_map 
    WHERE id_map.new_id=lane.lane_id  
    and id_map.column_name='lane_id' 
    and id_map.table_name='lane') 
and lane.lane_id is not null;

that usually takes somewhat around 1.5 seconds to finish. Here's the explain plan: http://explain.depesz.com/s/axNN

Sometimes though this query hangs and will not finish. It may run even for 11 hours without success. It then takes up 100% of the cpu.

The only locks this query takes are "AccessShareLock"s and they are all granted.

SELECT a.datname,
         c.relname,
         l.transactionid,
         l.mode,
         l.granted,
         a.usename,
         a.current_query, 
         a.query_start,
         age(now(), a.query_start) AS "age", 
         a.procpid 
    FROM  pg_stat_activity a
     JOIN pg_locks         l ON l.pid = a.procpid
     JOIN pg_class         c ON c.oid = l.relation
    ORDER BY a.query_start;

The query is run as a part of a java process that connects to a database using a connection pool and performs sequentially similar select queries of this format:

SELECT count(*) FROM {}  WHERE not exists (SELECT 1 FROM id_map WHERE id_map.new_id={}.{} and id_map.column_name='{}' and id_map.table_name='{}') and {}.{} is not null

no updates or delete are happening in parallel to this process so I don't think vacuuming can be the issue here. Prior to running the entire process (so before 6 queries of this sort are run) an analyze on all the tables were run.

postgres logs don't show any entry for the long running queries because they never finish and thus never get to be logged.

Any idea what may cause this kind of behavior and how to prevent it from happening?

the explain plan without analyze:

Aggregate  (cost=874337.91..874337.92 rows=1 width=0)
  ->  Nested Loop Anti Join  (cost=0.00..870424.70 rows=1565283 width=0)
        Join Filter: (id_map.new_id = lane.lane_id)
        ->  Seq Scan on lane  (cost=0.00..30281.84 rows=1565284 width=8)
              Filter: (lane_id IS NOT NULL)
        ->  Materialize  (cost=0.00..816663.60 rows=1 width=8)
              ->  Seq Scan on id_map  (cost=0.00..816663.60 rows=1 width=8)
                    Filter: (((column_name)::text = 'lane_id'::text) AND ((table_name)::text = 'lane'::text))
Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
norbitheeviljester
  • 952
  • 1
  • 6
  • 17
  • i had same issue with java. It was not postgres version - it was java not terminating connection... 100% CPU by postgres or java?.. what is in pg_stat_activity?.. – Vao Tsun May 20 '15 at 10:53
  • 100% by postgres. pg_stat_activity shows a neverending sql query – norbitheeviljester May 20 '15 at 10:59
  • when it goes to 100% CPU, can you check the plan?.. It could be nested loops instead of hash joins - which would mean you give too little RAM for caching... Another approach is "how fresh statistics are" But both are worth of studying if execution plan differs when it hits 100% CPU – Vao Tsun May 20 '15 at 11:10
  • I'm not familiar with any way of checking an execution plan of a already running sql query. I've attached the execution plan for the sql query in general in my post. Do you know of any way of accessing the real plan of a running query? – norbitheeviljester May 20 '15 at 11:12
  • you can't. unless there is some fancy extension for that... what I asked to do explain select... (skipping analyze) to see the expected plan – Vao Tsun May 20 '15 at 11:21
  • the explain plan is here: http://explain.depesz.com/s/axNN, the explain plan without analyze was added to the question – norbitheeviljester May 20 '15 at 11:27
  • aha - so it did switch to nested loops. now try setting caches to a higher values, and what is more essential, vacuum analyze verbose; – Vao Tsun May 20 '15 at 11:31
  • unfortunately I am unable to change the configuration of our development machine without issuing a ticket for it. As for the vacuum - this is a table that has only inserts and selects performed on it. There's absolutely no dead tuples in it. I don't see any reason for vacuuming? – norbitheeviljester May 20 '15 at 11:35
  • vacuum analyze will tell the size of indexes and tables, not only removed, but inserted as well... look at your "good" plan - Eg: (cost=0.00..1,478,526.18 rows = 79,084 width=8) (actual time = 229.454..635.994 rows = 1,565,284 loops = 1) - estimations differ from what you get after you actually explain analyze it (run it)... Gathering statistics is beneficial on all changes, while vacuum full is needed more to reclaim space and then analyze – Vao Tsun May 20 '15 at 11:48

2 Answers2

12
VACUUM ANALYZE VERBOSE;

refreshing statistics shall help db to choose optimal plan - not nested loops, which I believe take 100% CPU

Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
  • I managed to verify that this indeed was the root cause of the problem. The nested loop creates 1.5 million sequence scans of the id_map table which takes days to finish. When doing a vacuum analyze on both tables prior to running the query, the query finishes in under 2 seconds. – norbitheeviljester May 21 '15 at 11:07
0

This problem might be caused because (from what I understood):

  • Postgres has ran out the number of available transaction ids (When all of the two billion available transaction IDs have been used, the transaction IDs start over again at one, which results in wraparound issues which can cause severe data loss or DB shutdown)
  • The database is too segmented, i.e. DELETE or UPDATE (it is converted into INSERT + DELETE by Postgres) commands mark tuples as deleted but do not physically delete it.

If you have any cloud Server like GCloud, you can set some variables on Database flags to make VACUUM be called automatically and clean the tuples marked as deleted and is still in you database, and ANALYZE gather the latest statistics about frequently updated tables used on execution plan. Example:

autovacuum: on
autovacuum_analyze_scale_factor: 0.05
autovacuum_analyze_threshold: 10
autovacuum_naptime: 15
autovacuum_vacuum_cost_delay: 10
autovacuum_vacuum_cost_limit: 1000
autovacuum_vacuum_scale_factor: 0.1
autovacuum_vacuum_threshold: 25
log_autovacuum_min_duration: 0
track_counts: on

Source:

https://www.postgresql.org/docs/9.5/runtime-config-autovacuum.html https://www.techonthenet.com/postgresql/autovacuum.php https://aws.amazon.com/premiumsupport/knowledge-center/transaction-id-wraparound-effects/

GCloud database flags

Roni Castro
  • 1,968
  • 21
  • 40