2

I have a cronjob that is kicked off every night that involves building up a temporary table, dropping the current table on Redshift, and swapping in the temporary table for the old one. More than half of the time, this specific job gets stuck when dropping the existing table and behaving as if there is some pending transaction that is stopping the drop from going through.

This is just one of dozens of jobs that uses the exact same script to run overnight, none of which have ever had this issue; however, there are a few minor differences:

  • The box that this particular job is running on is a different box from all of the other production jobs, as this one is currently in a testing state.
  • The S3 key used on this box is different from the other box.

In addition to the fact that I have never seen this on any other job, this issue has been extremely difficult to troubleshoot for the following reasons:

  • I have not been able to replicate this issue by running the script manually on the same box it is currently being run on; the script executes as expected, with the table drop occurring in mere seconds. The only difference I can think of here is that I'm executing the script as ubuntu whereas the cronjob is executed from root.
  • I have not had any success identifying or terminating the sessions that are causing the drop to stall; I've looked high and low on Stack Overflow (this is the most applicable question with answers - redshift drop or truncate table very very slow), the Redshift docs, and otherwise, but nothing I've found has been the answer. When I see that the job is stalled, I've checked the following tables on Redshift and usually find that things are in the following state:
    • The temporary table has been created, but the old version of the destination table is still there.
    • The stv_locks table shows that that there are three processes running, with the lock_status of "Holding write lock," "Holding delete lock," and "Holding insert lock" respectively. The process ID associated with these is NOT the ID related to the current job.
    • The stv_tr_conflict table shows nothing.
    • The stv_recents table shows the drop with a status of Running.
    • The query that's supposedly creating the lock described above shows up in the svl_qlog as finished, so that seems to contradict the stv_locks table.
    • Using pg_terminate_backend to stop the associated process does not actually remove the session when querying stv_sessions, but DOES free up something that allows the job to finish.

Any help in figuring out what exactly is going on here would be greatly appreciated!

Community
  • 1
  • 1
AvocadoRivalry
  • 411
  • 1
  • 7
  • 16
  • It would not be related to where the SQL commands are originating. I'm not sure what is meant by "The S3 key used on this box", but if that is referring to the credentials used in a `COPY` command, that should not impact the request -- it either works or doesn't work. I would recommend using **AWS Support** to assist in diagnosing the issue. A subscription to Support can pay for itself in time saved! – John Rotenstein Apr 13 '16 at 21:26
  • Sounds like you do have a lock somewhere on schema, which could have been created by another connection/transaction, and it prevents other commands from executing. If you run scripts in parallel, or creating new connections for every command, I'd recommend combining them to make sure everything is serial and/or using the same connection. – denismo Apr 14 '16 at 05:41

1 Answers1

2

I faced the same problem, I just rebot RS then it works again normally.

Obadah Meslmani
  • 339
  • 3
  • 15
  • 1
    Interesting but worked for me. Unfortunately if you are running on a productive server this reboot will not be easy to do – Eralper Aug 20 '19 at 15:49