I an in the early stages of learning Airflow. I am learning Airflow to build a simple ETL (ELT?) data pipeline, and am in the process of the figuring out the architecture for the pipeline (what operators I should use). The basics of my data pipeline are going to be:
- Make HTTP GET request from API for raw data.
- Save raw JSON results into a GCP bucket.
- Transform the data and save into a BigQuery database.
...and the pipeline will be scheduled to run once daily.
As the title suggests, I am trying to determine if the SimpleHttpOperator or PythonOperator is more appropriate to use to make the HTTP GET requests for data. From this somewhat related stackoverflow post, stackoverflow post, the author simply concluded:
Though I think I'm going to simply use the PythonOperator from now on
It seems simple enough to write a 10-20 lines-of-code python script that makes the http request, identifies the GCP storage bucket, and writes to that bucket. However, I'm not sure if this is the best approach for this type of task (call api --> get data --> write to gcp storage bucket).
Any help or thoughts on this, any example links on building similar pipelines, etc. would be greatly helpful. Thanks in advance