2

I running some automated tasks on my postgres database at night using the pg_cron extension. I am moving certain old records to archive database tables. I am running 5 Stored Procedures concurrently on 5 different background workers, so they all start at the same time and run on different workers (I am assuming this is similar to running different Tasks on different Threads in Java). These 5 Stored Procedures are independent (moving records to archive tables), so they can run at the same time. I schedule them each using a command like

cron.schedule (myJob1,
    '* * * * *',
    'call my_stored_proc_1()'
);

cron.schedule (myJob2,
    '* * * * *',
    'call my_stored_proc_2()'
);

.
..
...

cron.schedule (myJob5,
    '* * * * *',
    'call my_stored_proc_5()'
);

NOW, I have some MORE dependent Store Procedures that I want to run. But they need to run AFTER these 5 Jobs finish/complete, because they are doing some DELETE... sql operations.

How can I have this second Stored Procedure (the one doing the DELETE queries) Job run AFTER my first 5 Stored Procedures Jobs when they are DONE? I don't want to set a CRON expression for the second Stored Procedure doing the DELETES, because I don't know what time the first 5 Stored Procs are even going to finish...

Below I included a little schematic of how the Jobs are currently triggered and how I want it to work (if possible): enter image description here

ennth
  • 1,698
  • 5
  • 31
  • 63
  • I would make cron functions dependent on data. So that function starts, check condition whether to run or not, goes to exit if conditions are not met, executes if conditions are met. This way it's possible implement dependency logic of any complexity. You can check condition of server: is it loaded or not, are there locks, are there another processes that runs on server. Cron schedule itself becames less relevant. Create logs for those tasks. It's sad that there are no autonomous transactions in postgresql exactly for logs. – Alex Yu Jul 31 '21 at 03:37
  • But how would I do that for the SECOND Cron job I want to RUN after the first 5 that run concurrently??? I have the Cron setup to run @1am every night... so even if I add a Conditional statement in the SECOND Cron job, such that it won't run until the first 5 are done... how will it even re-trigger itself??? – ennth Jul 31 '21 at 03:41
  • Aha. Let's say that first 5 tasks each must write a record in log: `{log_id, task_type, start_time, finish_time}`. Then the SECOND cron job on start checks that there are exists exactly 5 records from all 5 tasks that they did work. While it does not see that - exit, when it sees that all previous tasks finished work - start work – Alex Yu Jul 31 '21 at 03:46
  • So it's like `SELECT [condition 1] AND [condition 2] AND [condition 3] .. [condition 5] AS do_work` on start of function 6. And then if `do_work` is True - you start execution – Alex Yu Jul 31 '21 at 03:48

2 Answers2

1

Preface: how I understand problem

I hope that I understand the problem described by OP.

If I was wrong then it makes everything below invalid.

I suppose that it's about periodic night tasks heavy in CPU and/or IO.

E.g:

  • there are tasks A-C for archiving data
  • maybe task D-E for rebuilding aggregates / refreshing mat views
  • and finally task F that runs reindexing/analyze on whole DB

So it makes sense to run task F only after tasks A-E are finished.

Every task is needed to be run just once in a period of time:

  • once in a day or hour or week or only during weekends in a night time
  • it's better not to run in a time when server is under load

Does it fits with OP requirement - IDK.

For the sake of simplicity let's presume that each task runs only once in a night. It's easy to extend for other periods/requirements.

Data-driven approach

1. Add log table

E.g.

CREATE TABLE job_log (
  log_id bigint,
  job_name text,
  log_date timestamptz
) 

Tasks A-E

On start

For each job function do check:

IF  EXISTS(
  SELECT 1 FROM job_log 
    WHERE
      job_name = 'TaskA' # TaskB-TaskE for each functiont
      AND log_date::DATE = NOW()::DATE # check that function already executed  this night
) OR  EXISTS(
   SELECT 1 FROM pg_stat_activity 
     WHERE 
       query like 'SELECT * FROM jobA_function();'  # check that job not executing right now
) THEN RETURN;
END IF;

It's possible that other conditions could be added: look for amount of connections, existence of locks and so on.

This way it will be guaranteed that function will not be executed more frequently than needed.

On finish

INSERT INTO job_log
SELECT
   (SELECT MAX(log_id) FROM job_log) + 1 # or use sequences/other autoincrements
  ,'TaskA'
  ,NOW()

Cronjob schedule

The meaning of it becames different.

Now it's: "try to initiate execution of task".

It's safe to schedule it for every hour between a chosen period or even more frequently.

Cronjob cannot know if the server is under load or not, are there locks on a table or maybe somebody started execution of task manually.

Job function could be more smart in that.

Task F

Same as above but check on start looks for completion of other tasks.

E.g.

IF NOT EXISTS(
  SELECT 1 FROM job_log 
     WHERE 
       job_name = 'TaskA'
       AND log_date::DATE = NOW()::DATE
) OR NOT EXISTS(  
  SELECT 1 FROM job_log 
     WHERE 
       job_name = 'TaskB'  
       AND log_date::DATE = NOW()::DATE
)
....  # checks for completions of other tasks
OR EXISTS(
  SELECT 1 FROM job_log 
    WHERE
      job_name = 'TaskF' # TaskB-TaskE for each functiont
      AND log_date::DATE = NOW()::DATE # check that function already executed  this night
) OR  EXISTS(
   SELECT 1 FROM pg_stat_activity 
     WHERE 
       query like 'SELECT * FROM jobF_function();'  # check that job not executing right now
) THEN RETURN;

On completion

Write to job_log the same as other functions.

UPDATE. Cronjob schedule

Create multiple schedule in cronjob.

E.g.

Let's say tasks A-E will run approximately 10-15 minutes.

And it's possible that one or two of them could work for 30-45-60 minutes.

Create a schedule for task F to attempt start every 5 minutes.

How that will work:

  • attempt 1: task A finished, other still working -> exit
  • attempt 2: task A-C finished -> exit
  • attempt 3: tasks A-E finished -> start task F
  • attempt 4: tasks A-E finished but in pg_stat_activity there is an executing task F -> exit
  • attempt 5: tasks A-E finished, pg_stat_activity is empty but in logs we see that task F already executed -> no need to work -> exit
  • ... all other attempts will be the same till next night

Summary

It's easy extend this approach for any requirements:

  • another periodicity
  • or make it unperiodic at all. E.g. make a table with trigger and start execution on change
  • dependencies of any depth and/or "fuzzy" dependencies
  • ... literally everything

Conception remains the same:

  • cronjob schedule means "try to run"
  • decision to run or not is data-driven

I would be glad to hear criticism of any kind - who knows maybe I'm overlooking something.

Alex Yu
  • 3,412
  • 1
  • 25
  • 38
  • This looks good , I think its is similar to my problem. But for TASK F, what if the logs don't exist, then that means the jobs didn't finish.. and task F cannot START. That's expected behavior, but how can Task F know when to retry? and I can't have Task F retry many times – ennth Aug 01 '21 at 23:32
  • It's absolutely OK to have many retries in cron. Cron schedule is a schedule to attempt to start job, whether it will really start - it's for job function to decide. And note that check in `pg_stat_activity` is crucial - so that TaskF will not start multiple times. – Alex Yu Aug 02 '21 at 03:07
  • @ennth take a look at update of answer. I clarified scheduling of Task F – Alex Yu Aug 02 '21 at 03:23
  • For TaskF, you don't have to check pg_stat_activity because pg_cron will NOT restart/start the job if its currently executing anyway, right??? – ennth Aug 03 '21 at 20:11
  • @ennth Yes. But what if somebody started it manually? It's better to check then not. What would be if you decide to run it from other scheduler? – Alex Yu Aug 03 '21 at 23:03
  • that makes sense. And just to make sure, Alex, this only works if the CRON express for TaskF is something re-occuring like '* 1 * * *' for example, to run every minute? Because if I set a CRON for TaskF at 1AM, and Task A-E took 2 hours, EVEN With the Log table and checking if it exists, TaskF would be outside that Window (since Task A-E took 2 hours) and would never run. So TaskF needs some Cron to check constantly right? – ennth Aug 04 '21 at 16:54
  • I also thinks its more efficient to simply query the cron.job_run_details table and check the job based on ID and it will have a status SUCCEEDED/FAILED/EXECUTING which is better than pg_stat_activity table – ennth Aug 04 '21 at 20:43
  • Hm. Look from this side: A. `pg_stat_activity` is not a table it's an internal postgresql structure that has table interface, That means that it's not transactional - you see real picture of what's going on server, B. `cron.job_run_details` is a normal table and in postgresql there is no autonomous transactions (as in Oracle) or ability for dirty reads. Which means that it's content is less credible for viewing real-time state. Overall: if you see that `pg_cron` fits your needs - go for it. I had an experience when we hit limitation of SQLServer Agent, be prepared for something similar. – Alex Yu Aug 04 '21 at 22:11
  • one problem I noticed is that the FUNCTIONS or PROCEDURES I am using are running on a separate database (not the default 'postgres' db where the pg_cron schema is at), so you can't query any cron tables from my functions/procs when the pg_cron starts their JOB. the cron job runs from postgres db – ennth Aug 05 '21 at 11:31
1

You could to use pg_stat_activity view to ensure that there are no active query like your jobs 1-5.

Note:

Superusers and members of the built-in role pg_read_all_stats (see also Section 21.5) can see all the information about all sessions

...
while (
    select count(*) > 0 
    from pg_stat_activity 
    where query in ('call my_stored_proc_1()', 'call my_stored_proc_2()', ...))
loop
    perform pg_sleep(1);
    perform pg_stat_clear_snapshot(); -- needs to retrieve the fresh data
end loop;
...

Just insert this code at the beginning of your stored proc 6 and call it for a few seconds after the jobs 1-5.

Note 1:

The condition could be simplified and generalized using regexp:

when query ~ 'my_stored_proc_1|my_stored_proc_2|...'

Note 2:

You could to implement timeout using clock_timestamp() function:

...
is_timedout := false;
timeout := '10 min'::interval; -- stop waiting after 10 minutes
start_time := clock_timestamp();
while (...)
loop
    perform pg_sleep(1);
    perform pg_stat_clear_snapshot(); -- needs to retrieve the fresh data
    if clock_timestamp() - start_time > timeout then
        is_timedout := true;
        break;
    end if;
end loop;

if is_timedout then
    ...
else
    ...
end if;
...

Note 3:

Look at the other columns of the pg_stat_activity. You may need to use them as well.

Abelisto
  • 14,826
  • 2
  • 33
  • 41
  • I see that you propose solution somewhat similar to mine. The core is: using non-transactional (and no MVCC) behaviour of `pg_stat_activity` – Alex Yu Sep 09 '21 at 05:58