PSQL: VACUUM ANALYZE is showing incorrect oldest xmin

Question

When I run vacuum verbose on a table, the result is showing an oldest xmin value of 9696975, as shown below:

table_xxx: found 0 removable, 41472710 nonremovable row versions in 482550 out of 482550 pages
DETAIL:  41331110 dead row versions cannot be removed yet, oldest xmin: 9696975
There were 0 unused item identifiers.

But when I check in pg_stat_activity, there are no entries with the backend_xmin value that matches this oldest xmin value. Below is the response I get when I run the query:

SELECT backend_xmin
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;

Response:

backend_xmin
------------
10134695
10134696
10134696
10134696
10134696

The issue I am facing is that the vacuum is not removing any dead tuples from the table. I tried methods mentioned in: this post. But it didn't help.

edit: The PostgreSQL version is 13.6 running in Aurora cluster.

I would query [pg_locks](https://www.postgresql.org/docs/current/view-pg-locks.html). — Adrian Klaver, Apr 10 '22 at 17:11
I tried running the query: "SELECT query,backend_xmin, state,locktype,mode FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE relation::regclass = 'table_xxx'::regclass AND granted IS TRUE AND backend_xmin IS NOT NULL;" But it is returning empty result — thomas, Apr 10 '22 at 17:20
Try a plain `select * from pg_locks` to see if there are any locks. — Adrian Klaver, Apr 10 '22 at 17:28
What version of Postgres is this? Add as update to question. I ask as the latest releases have this: "Allow parallel vacuuming and concurrent index building to be ignored while computing oldest xmin. Non-parallelized instances of these operations were already ignored, but the logic did not work for parallelized cases. Holding back the xmin horizon has undesirable effects such as delaying vacuum cleanup." — Adrian Klaver, Apr 10 '22 at 17:36
The version is 13.6, running in Aurora cluster (Also added as update to the question). Does rebooting the db instance reset the xmin value? — thomas, Apr 10 '22 at 17:53
1) 13.6 has the fix assuming AWS just passes that along in its version. 2) Are there any locks? 3) I am not aware that a reboot changes the xmin value. What needs to be kept in mind is from here [Postgres Aurora](https://aws.amazon.com/rds/aurora/postgresql-features/): "The PostgreSQL-compatible edition of Aurora delivers up to 3X the throughput of standard PostgreSQL running on the same hardware, ..." . Not exactly sure what AWS does to Postgres when modifying it for Aurora. — Adrian Klaver, Apr 10 '22 at 18:01

Laurenz Albe · Answer 1 · 2023-04-20T02:39:10.430

1

In addition to backend_xmin, also check backend_xid from pg_stat_activity. A transaction that has a transaction ID can also block VACUUM progress.

Apart from old transactions, there are some other things that can hold the “xmin horizon” back:

stale replication slots (see pg_replication_slots)
abandoned prepared transactions (see pg_prepared_xacts)
a standby server with hot_standby_feedback = on that is running a long query (look at backend_xmin in pg_stat_replication)

For more details, see my article on the topic.

edited Apr 20 '23 at 02:39

answered Apr 11 '22 at 06:07

Laurenz Albe

209,280
17
206
263

we faced a similar problem wrt vacuum not able to function properly and the problem was with a stale replication slot. – cvipul Apr 19 '23 at 21:19

Erwin Brandstetter · Answer 2 · 2022-04-11T05:55:06.387

0

A row is only completely dead when no live transaction can see it anymore. I.e. no transaction that has been started before the row was updated / deleted is still running. That does not necessarily involve any locks at all. The mere existence of a long-running transaction can block VACUUM from cleaning up.

So the system view to consult is pg_stat_activity. Look for zombi-transactions that you can kill. Then VACUUM can proceed.

Old prepared transactions can also block for the same reason. You can check pg_prepared_xacts for those.

Of course, VACUUM only runs on the primary server, not on replicas (standby) instances - in case streaming replication has been set up.

answered Apr 10 '22 at 18:28

Erwin Brandstetter

605,456
145
1,078
1,228

I did look at the pg_stat_activity, to look for any idle transactions that were running for long. But we could not see any. However we have tried rebooting the db instance, and after that the vacuum has successfully removed the dead tuples. – thomas Apr 10 '22 at 18:53
1

@thomas: A reboot (restart) is a brute-force catch-all. But you'll never know what actually blocked now .... – Erwin Brandstetter Apr 10 '22 at 19:00
yes, that is what I am still worried about. Since I am still unsure of the actual cause, I think the same issue may occur in future: may be with more damage. But as this issue was causing downtime in our live applications, I went with the restart approach. Though I will retry to reproduce the same issue in our staging/dev servers to find out the root cause. – thomas Apr 10 '22 at 19:05
2

@thomas next time, you want to look at backend_xid, not backend_xmin from pg_stat_activity. The first one might tell you who is causing the problem, while the second is just who is a victim of the problem. – jjanes Apr 11 '22 at 00:42

PSQL: VACUUM ANALYZE is showing incorrect oldest xmin

2 Answers2