0

I am using apache airflow to implement an ETL pipeline.

Basically I have to retrieve data from an API endpoint and this is my problem:

1) the endpoint is very limited, I can make up to one request per minute,

2) The data can be divided in pages. I can't know how many beforehand. Of course a request returns only one page, so I am forced to wait a minute to get the second page and join the data it contains with those I got from the first page.

3) I have to repeat this operation a lot of times, to get data about different entities, changing of course the parameters of the request.

Currently I have a dag that is scheduled to run each minute, this dag has to download all the pages and save the complete result into a database. The problem, is that by scheduling a new run each minute, if the previous run had to request multiple pages then I will exceed the api's limit.

The best solution would be to not schedule new runs of that dag until the previous run has finished. Is it possible?

2 Answers2

1

You can achieve this by setting max_active_runs = 1 when creating your DAG.

See the following related How to limit Airflow to run only 1 DAG run at a time?

If the current run is not yet complete, the new run will begin once the current one finishes.

Kurt Kline
  • 1,724
  • 1
  • 10
  • 23
  • Thank you for your response, I know of this option, I was wondering if something more elegant existed since if I do as you suggested I would have a large amount of runs waiting to be executed (growing no stop). Furthermore the execution of the sucesive run would start as soon as the previous one has finished, without waiting for a minute and therefore exceeding the api's limits. –  May 12 '20 at 19:47
0

When the program starts it checks whether a lock file exists.
If the lock file exists the program exists immediately.
If the lock file doesn't exist the program creates the lock file, performs the work and removes the lock file.

Or using a pseudocode:

if not exists(lock_file):
    create(lock_file)
    do_whatever_needed()
    remove(lock_file)

With this approach you can keep the current scheduling but two processes will never run simultaneously.

dlask
  • 8,776
  • 1
  • 26
  • 30
  • Thank you for your response, for now I have implemented a similar system using airflow's variables. However I was wondering if something more elegant existed since with this method I will end up having a high number of useless runs polluting my airflow's statistics. –  May 12 '20 at 19:51
  • Either you use regular scheduling where many executions are possibly useless or you keep the application always running making all what's needed by itself. – dlask May 12 '20 at 19:55