I have a cronjob that is kicked off every night that involves building up a temporary table, dropping the current table on Redshift, and swapping in the temporary table for the old one. More than half of the time, this specific job gets stuck when dropping the existing table and behaving as if there is some pending transaction that is stopping the drop from going through.
This is just one of dozens of jobs that uses the exact same script to run overnight, none of which have ever had this issue; however, there are a few minor differences:
- The box that this particular job is running on is a different box from all of the other production jobs, as this one is currently in a testing state.
- The S3 key used on this box is different from the other box.
In addition to the fact that I have never seen this on any other job, this issue has been extremely difficult to troubleshoot for the following reasons:
- I have not been able to replicate this issue by running the script manually on the same box it is currently being run on; the script executes as expected, with the table drop occurring in mere seconds. The only difference I can think of here is that I'm executing the script as
ubuntu
whereas the cronjob is executed fromroot
. - I have not had any success identifying or terminating the sessions that are causing the
drop
to stall; I've looked high and low on Stack Overflow (this is the most applicable question with answers - redshift drop or truncate table very very slow), the Redshift docs, and otherwise, but nothing I've found has been the answer. When I see that the job is stalled, I've checked the following tables on Redshift and usually find that things are in the following state:- The temporary table has been created, but the old version of the destination table is still there.
- The
stv_locks
table shows that that there are three processes running, with thelock_status
of "Holding write lock," "Holding delete lock," and "Holding insert lock" respectively. The process ID associated with these is NOT the ID related to the current job. - The
stv_tr_conflict
table shows nothing. - The
stv_recents
table shows thedrop
with a status ofRunning
. - The query that's supposedly creating the lock described above shows up in the
svl_qlog
as finished, so that seems to contradict thestv_locks
table. - Using
pg_terminate_backend
to stop the associated process does not actually remove the session when queryingstv_sessions
, but DOES free up something that allows the job to finish.
Any help in figuring out what exactly is going on here would be greatly appreciated!