3

I have a questions regarding differences between Apache Airflow and Metaflow(https://docs.metaflow.org/). As far as I understood Apache airflow is just a job scheduler, that runs tasks. Metaflow from Netflix is as a dataflow library, which creates machine learning pipeline(dataflow is available) in forms of DAGs. Basically it means, that Metaflow can be executed on the Apache Airflow?

is my understanding correct? If yes, is it possible to convert Metaflow DAG into Apache Airflow DAG?

Daniel Yefimov
  • 860
  • 1
  • 10
  • 24

1 Answers1

4

Honestly, I haven't worked with Metaflow and thank you for introducing it to me! There is a nice introduction video you can find on Youtube.

Airflow is a framework for creating scheduled pipelines. A pipeline is a set of tasks, linked between each other that represent an Directed Acyclic Graph. Pipeline can be scheduled, you can tell how often or when it should run, you can tell when it should've ran in the past and what time period it should backfill. You can run the whole Airflow as one single docker container or you can have multi-node cluster, it has bunch of already existing operators to integrate with 3rd party services. I recommend to look into Airflow Architecture and concepts.

Metaflow looks like something similar, but created specifically for data-scientists. I can be wrong here, but looking at the Metaflow Basics it looks like I can the same way create a scheduled pipeline similar to Airflow.

I would look in specific tools you want to integrate with and which one of both integrates better. As mentioned, Airflow has lots of already made connectors and operators, as well as, powerful scheduler with backfill and Jinja template language to design your DB queries for enter link description here.

Hope that is somewhat helpful. Here is also some nice article with feature comparison.

andnik
  • 2,405
  • 2
  • 22
  • 33
  • 1
    Thank you for you answer! It looks like the big difference between airflow and metaflow is, that metaflow allows dataflow between steps. I am looking for a dataflow engine, that can be integrated into Airflow and thought it might be possible to do it with metaflow. – Daniel Yefimov Jan 04 '22 at 08:24
  • 1
    Is that what you are looking for? https://www.astronomer.io/guides/airflow-passing-data-between-tasks – andnik Jan 04 '22 at 09:14
  • 1
    Something like that. But it's not optimal solution, because if data a bigger than 2GB it won't work. – Daniel Yefimov Jan 04 '22 at 09:19
  • 1
    Yeah, for this I would consider local (SQLite or arbitrary file format) if you run your tasks on one machine, or external storage, like S3. – andnik Jan 04 '22 at 09:23