0

I an in the early stages of learning Airflow. I am learning Airflow to build a simple ETL (ELT?) data pipeline, and am in the process of the figuring out the architecture for the pipeline (what operators I should use). The basics of my data pipeline are going to be:

  1. Make HTTP GET request from API for raw data.
  2. Save raw JSON results into a GCP bucket.
  3. Transform the data and save into a BigQuery database.

...and the pipeline will be scheduled to run once daily.

As the title suggests, I am trying to determine if the SimpleHttpOperator or PythonOperator is more appropriate to use to make the HTTP GET requests for data. From this somewhat related stackoverflow post, stackoverflow post, the author simply concluded:

Though I think I'm going to simply use the PythonOperator from now on

It seems simple enough to write a 10-20 lines-of-code python script that makes the http request, identifies the GCP storage bucket, and writes to that bucket. However, I'm not sure if this is the best approach for this type of task (call api --> get data --> write to gcp storage bucket).

Any help or thoughts on this, any example links on building similar pipelines, etc. would be greatly helpful. Thanks in advance

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
Canovice
  • 9,012
  • 22
  • 93
  • 211

1 Answers1

2

I recommend you to see airflow as a glue between processing steps. The processing performed into Airflow should be to conditionally trigger or not a step, doing loop on steps and handle errors.

Why? Because, if tomorrow you choose to change your workflow app, you won't have to code again your process, you will only have to rewrite the workflow logic (because you changed your workflow app). A simple separation of concern.

Thereby, I recommend you to deploy your 10-20 lines of python code into a Cloud Functions and to set a SimpleHTTPOperator to call it. In addition, it's far more easier to a function than a workflow (to run and to look at the code). The deployments and the updates will be also easier.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76