19

Our database seems to be broken, normally it uses about 1-2% of cpu, but if we run some additional backend services making UPDATE and INSERT queries for 10M rows table (about 1 query per 3 second) everything is going to hell (including CPU increase from 2% to 98% usage).

We have decided to debug what's going on, run VACUUM and ANALYZE to learn what's wrong with db but...

production=# ANALYZE VERBOSE users_user;
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
ERROR:  tuple already updated by self

we are not able to finish ANALYZE on ANY of the tables and could not find any information about this issue. Any suggestions what can be wrong?

 PostgreSQL 9.6.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit

Additional info as requested in comments:

Maybe you have a corrupted pg_class

SELECT * FROM pg_class WHERE relname = 'users_user';

Output: https://pastebin.com/WhmkH34U

So the first thing to do would be to kick out all other sessions and try again

There are no additional sessions, we have dumped the whole DB on the new testing server, issue still occur, there are no clients connected to this DB

Damian Gądziak
  • 895
  • 1
  • 9
  • 12
  • 1
    That sounds like a catalog corruption. I recommend doing a `VACUUM FULL ANALYZE pg_statistic`, though you should know that will lock up the entire database for the duration of the operation. If that doesn't work, start doing the same for all of the tables in `\dt pg_catalog.*`. – jmelesky May 21 '18 at 14:04
  • `/* * The tuple has already been locked by our own transaction. This is * very rare but can happen if multiple transactions are trying to * lock an ancient version of the same tuple. */ return HeapTupleSelfUpdated;` in sourcefile backend/access/heap/heapam.c" line 5588 – wildplasser May 21 '18 at 14:15
  • So the first thing to do would be to kick out all other sessions and try again ,IMHO. – wildplasser May 21 '18 at 14:17
  • `10M rows table (about 1 query per 3 second) everything is going to hell (including CPU increase from 2% to 98% usage).`BTW:this is quite common (for instance, if there are no indexes supporting FK-constraints) – wildplasser May 21 '18 at 14:20
  • 1
    The strange thing here is the duplicate `INFO`, which might explain the error (updating the same `pg_statistic` row twice). Maybe you have a corrupted `pg_class`? What do you get for `SELECT * FROM pg_class WHERE relname = 'users_user';`? – Laurenz Albe May 22 '18 at 07:33
  • @LaurenzAlbe here is the output https://pastebin.com/WhmkH34U – Damian Gądziak May 22 '18 at 10:13
  • @jmelesky your query took 0.0 seconds on pg_statistic, seems like there wasn't anything to scan. I have done this query also for all of 69 tables and still the same... btw. doing VACUUM FULL ANALYZE for users_user did not raise an error about tuple updated itself, but running only ANALYZE does – Damian Gądziak May 22 '18 at 10:46
  • please `SELECT ctid,xmax,xmin,cmax, cmin,oid::regclass FROM pg_class WHERE relname = 'users_user'` – Vao Tsun May 24 '18 at 16:37
  • @VaoTsun ctid | (16,44) xmax | 0 xmin | 3844 cmax | 4 cmin | 4 oid | users_user – Damian Gądziak May 27 '18 at 17:54
  • this is odd, @LaurenzAlbe query gives two rows, and mine just one?.. – Vao Tsun May 27 '18 at 18:24
  • @VaoTsun once again both queries on production to compare https://pastebin.com/H9z8f0jF – Damian Gądziak May 28 '18 at 17:33
  • I would start the database in single user mode and reindex `pg_class`. See if that makes the problem go away. – Laurenz Albe May 29 '18 at 11:35

1 Answers1

6

I'd recommend you to start the server with the following parameters before searching for duplicated rows:

enable_indexscan = off
enable_bitmapscan = off
ignore_system_indexes = on

If your server crashed, indexes could be in a different state of table data. This happens when corruption affects transaction visibility (pg_clog), for example. Then search for a duplicated row on pg_class or pg_statistic as mentioned in comments.

You could also try to clean up pg_statistic. First, start the server with:

allow_system_table_mods = on

And then issue a TRUNCATE TABLE and ANALYZE afterward:

--Cleaning pg_statistic
TRUNCATE TABLE pg_catalog.pg_statistic;
--Analyze entire database
ANALYZE VERBOSE;

If the problem is in pg_statistic this should be enough.

Michel Milezzi
  • 10,087
  • 3
  • 21
  • 36