4

In the docs, they say that you should avoid passing data between tasks:

This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.

I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?

I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?

rescdsk
  • 8,739
  • 4
  • 36
  • 32

3 Answers3

3

What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:

  1. PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
  2. SqlRelated operator that takes the data from the storage and insert it into the database

Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow

Javier Lopez Tomas
  • 2,072
  • 3
  • 19
  • 41
2

fundamentally, each instance of an operator in a DAG is mapped to a different task.

This is a subtle but very important point: in general if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator

the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using Airflow's XCOM, which is similar to a key-value store.

In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :

  1. Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not before continuing.
  2. Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.

It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process. In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check. Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.

hence sharing information across tasks using xcom does come in handy in this case.

you can refer following link for more info :

  1. https://www.astronomer.io/guides/airflow-datastores/
  2. Airflow - How to pass xcom variable into Python function
Anand Vidvat
  • 977
  • 7
  • 20
  • Ah, in the first paragraph, are you saying that it's bad to share data between *operators* but it's OK to share data between *tasks* (operator instances)? – rescdsk Nov 24 '20 at 19:43
  • speaking in object oriented manner, operators are actually just Classes, designed for specific functionality. and when i say instance of a operator, i really mean, instance as in object of this class (operator) so technically, if you want to pass some value within same class, you can use attributes/properties of that class but remember no two object of the same class will share values/properties unless they are static properties. – Anand Vidvat Nov 24 '20 at 19:48
0

I've decided that, as mentioned by @Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.

For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!

rescdsk
  • 8,739
  • 4
  • 36
  • 32