8

Sporadically we get PG::UndefinedTable errors while using ActiveRecord. The association table name is some how corrupted and I quite often see Cancelled appended to the end of the table name.

E.g:

ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR:  relation "fooCancell" does not exist 
ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR:  relation "Cancelled" does not exist
ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR:  relation "barC" does not exist

In the example above, I have obfuscated the table name by using foo and bar.

We see this errors when the rails project is running inside Puma. Queue workers seems to be doing okay.

The tables in the error message doesn't correspond to real tables or models. It looks like the case of memory corruption. Has anyone seen such issues? If so how did you get around it?

puma.rb

on_worker_boot do
  ActiveRecord::Base.establish_connection
end

database.yml

production:
  url:  <%= ENV["DATABASE_URL"] %>
  pool: <%= ENV['DB_CONNECTION_POOL_SIZE'] || 5%>
  reaping_frequency: <%= ENV['DB_CONNECTION_REAPING_FREQUENCY'] || 10 %>
  prepared_statements: false
Harish Shetty
  • 64,083
  • 21
  • 152
  • 198
  • 3
    What is your question? – mrzasa Feb 15 '18 at 23:04
  • What happens when you dump your table structure? Does it looks right to you? – Taryn East Feb 15 '18 at 23:12
  • @mrzasa Thats a good question! I composed the question in my text editor and made a copy paste error. Now, the last para contains the question. – Harish Shetty Feb 15 '18 at 23:49
  • @TarynEast I have updated my question to add more context.. – Harish Shetty Feb 15 '18 at 23:50
  • This looks like it could possibly be from setting up the connections incorrectly and falling victim to some sort of concurrency issue. Could you show us an example empty model if it has anything like `establish_connection` or similar in it? Also, please show us anything that is not in the `database.yml` which may be influencing your database access. – Cody Gustafson Feb 22 '18 at 19:13
  • @CodyGustafson updated the question.. – Harish Shetty Feb 22 '18 at 19:54
  • I don't see anything wrong with that. Do you have models that access more than one table? Or perhaps a gem used for sharding or something similar? – Cody Gustafson Feb 22 '18 at 20:36
  • @CodyGustafson We don't use sharding. We do use `composite_primary_keys` gem. Other than that there are no AR related gems in the project. – Harish Shetty Feb 22 '18 at 22:57
  • From the sounds of it, the table name cache might be getting corrupted. However, given that there are numerous ways of querying postgres directly with AR, it would really help to know how you're constructing the query--with raw SQL or using relations, and if possible post some simplified fragments of your code and or generated SQL statements. Mainly because with the exception being thrown from the underlying `pg` gem, all I'm seeing is that postgres can't find the table name it's being handed. – photoionized Feb 28 '18 at 18:18
  • @photoionized No raw queries. Every query is constructed using AR relations or scopes or AR `where` function. – Harish Shetty Feb 28 '18 at 19:54
  • Why is this marked `puma`? Is it limited to the `puma` server? You can test by installing `iodine` and running it with the same settings as puma (multi-threaded, clustered or both)... does the issue persist? I really doubt it's server related. – Myst Mar 01 '18 at 13:51
  • P.S. consider testing your app without clustering (single worker). – Myst Mar 01 '18 at 15:04

2 Answers2

1

It looks like reaping_frequency may be the issue. I found a couple claims that they may have a threading bug. I would try removing that option or setting it to nil and see if that works. The only other thing I can think of is if you are manually calling Thread.new and using active record within it. Here are the few claims against reaping:

http://omegadelta.net/2014/03/15/the-rails-grim-reaper/

https://github.com/mperham/sidekiq/issues/1936

Search for "DO fear the Reaper" here: https://www.google.com/amp/s/bibwild.wordpress.com/2014/07/17/activerecord-concurrency-in-rails4-avoid-leaked-connections/amp/

Cody Gustafson
  • 1,440
  • 11
  • 13
  • I doubt a leaking connection would cause corruption... when connections leak, they aren't accessible, causing "starvation" (application hangs waiting for a connection to become available). They don't cause internal corruption within a connection. – Myst Mar 01 '18 at 14:12
  • Yeah, leaking connections itself wouldn't be the issue and my answer _may_ be a stretch. I focused more where I believed there could be thread-unsafe variables which could cause unexpected issues. I like your thoughts on `fork` being the issue. That would makes a lot of sense. – Cody Gustafson Mar 01 '18 at 14:34
1

I'm hazarding a guess here, based on this possibly related error...

But you might be either:

  1. calling fork within your application; OR
  2. calling ActiveRecord routines (using database calls) before the server (puma) is forking it's worker processes (during the app initialization).

Either of these will break ActiveRecord's synchronization and cause multiple processes to share the database connection pool without synchronizing it's use (resulting in interlaced and corrupt database commands).

If you are using fork, make sure to close all the ActiveRecord database connections and reinitialize the connection pool (there's a function call that does it, but I don't remember it of the top of my head, maybe ActiveRecord.disconnect! or ActiveRecord.connection_pool.disconnect!).

Otherwise, before running Puma (either during the initialization process or using Puma's after_fork), close all the ActiveRecord database connections and reinitialize the connection pool.

Myst
  • 18,516
  • 2
  • 45
  • 67
  • We don't manually `fork` in our application. But I do suspect it might be something to do with puma forks. We access the database in few initializers, I am not sure if that can cause corruption.. – Harish Shetty Mar 01 '18 at 19:49
  • @HarishShetty - yes, accessing the database in an initializer will definitely cause this same issue. The code would have to either delay the initialization, or close the connection in each initializer (once you're done with the object) or close all connections in Puma's `after_fork` callback. Otherwise, the connections will remain opened and they will be shared among the processes without being properly synchronized. – Myst Mar 01 '18 at 19:53
  • It looks like Puma recommends disconnecting in `before_fork`(Look at this discussion: https://github.com/puma/puma/issues/1001). Also, puma_heroku made a change to adhere to this behavior: https://github.com/puma/puma-heroku/pull/2/files – Harish Shetty Mar 01 '18 at 21:12
  • I will run this fix in production for few days and see if I get the issue again.. – Harish Shetty Mar 01 '18 at 21:14
  • @HarishShetty I hope this fixes the issue. Good luck! and thank you for posting the links. I hop the information here will help future visitors as well. – Myst Mar 01 '18 at 21:16