0

I have a data product that receives realtime streams of vehicles data, process the information and then displays it through rest APIS. I am looking to rebuild the ETL side of the system in order to enhance reliability and architecture order. I am considering to use Apache Airflow, but have some doubts.

The python microservices I need to orchestrate are complex and have many different dependencies, hence a monolithic solution would be huge and tricky if implemented with python operators in Airflow. Do you see any convenient options of using Airflow for these needs? Might be a better solution to use Kafka or any other Messaging System?

Thanks

Meghdeep Ray
  • 5,262
  • 4
  • 34
  • 58
  • I can't predict the *latency tolerance* of your task at hand, but do understand that Airflow is a replacement of crons, not *realtime scheduler / orchestrator* (like the one your OS has). Read the last para [here](https://stackoverflow.com/a/55132959/3679900) Airflow tasks easily exhibit [scheduling latencies](https://www.google.com/search?q=airflow+scheduling+latency) reaching upto 30 seconds (i've even seen minutes). From the first look, event sourcing (as already coined by you) appears to be better solution (though capturing dependencies there would be a challenge) – y2k-shubham Jun 12 '20 at 18:30
  • also have a look at [Cadence](https://cadenceworkflow.io/docs/) – y2k-shubham Jun 12 '20 at 19:43

1 Answers1

1

Airflow is definitely used to orchestrate ETL systems however if you require real time ETL, a message queue is probably better.

However you can Trigger DAGs Externally if it suits your use case. Now depending on how distributed you want, you can also have multiple instances running to parallelize and speed up your operations.

If you think batch processing will be helpful, you can have a pseudo realtime set up with a very fast schedule. This will ( in my opinion ) be easier to debug then a traditional message queue system.

Airflow gives you a lot of modularity which lets you break down a complex architecture and debug components easily compared to a messaging queue. It also has provisions for auto requeueing and also has named queues which you can use in case you need to send large data loads to a separate processing queue on a machine with more resources.

I have worked on a similar use case as yours where we needed to process realtime data and display it via APIs and we used a mixture of Airflow and messaging queues. In my experience it was always a lot easier to figure out where and why something went wrong with Airflow due to the Airflow UI which makes getting a high level overview incredibly efficient.

Meghdeep Ray
  • 5,262
  • 4
  • 34
  • 58