I am using apache airflow to implement an ETL pipeline.
Basically I have to retrieve data from an API endpoint and this is my problem:
1) the endpoint is very limited, I can make up to one request per minute,
2) The data can be divided in pages. I can't know how many beforehand. Of course a request returns only one page, so I am forced to wait a minute to get the second page and join the data it contains with those I got from the first page.
3) I have to repeat this operation a lot of times, to get data about different entities, changing of course the parameters of the request.
Currently I have a dag that is scheduled to run each minute, this dag has to download all the pages and save the complete result into a database. The problem, is that by scheduling a new run each minute, if the previous run had to request multiple pages then I will exceed the api's limit.
The best solution would be to not schedule new runs of that dag until the previous run has finished. Is it possible?