0

I am newbie in apache beam environment. Trying to fit apache beam pipeline for batch orchestration.

My definition of batch is as follows

Batch==> a set of jobs,
Job==> can have one or more sub-job.

There can be dependencies between jobs/sub-jobs.

Can apache beam pipeline be mapped with my custom batch??

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Rahul Ranjan
  • 195
  • 2
  • 11
  • It would definitely help if you could provide more details about the specifics of the jobs. Are all the sub-jobs part of a apache beam pipeline or are they all different apache beam pipelines that need to run one after the another or are some of the sub jobs not related to apache beam at all example deleting objects in a bucket. – Saransh Apr 30 '22 at 12:36
  • @SaranshVashisth : sub-jobs will be part of the same pipeline.. Few of them might need to be run one after the another, and other can run as parallel. pipeline to batch mapping should be one to one – Rahul Ranjan May 01 '22 at 01:44
  • Yeah as the answer suggests, you should take a look at cloud composer. We're also using it for in the project I'm working on, to ingest data, preprocess it and then send it down to a pubsub topic for api consumption in seperate dataflow and bigquery procedure jobs. – Saransh May 02 '22 at 02:43
  • If my answer addressed your question, please consider accepting it. If not, let me know so that I can improve my answer. – Shipra Sarkar May 04 '22 at 05:31

2 Answers2

2

Apache Beam is unified for developing both batch and stream pipelines which can be run on Dataflow. You can create and deploy your pipeline using Dataflow. Beam Pipelines are portable so that you can use any of the runners available according to your requirement.

Cloud Composer can be used for batch orchestration as per your requirement. Cloud Composer is built on Apache Airflow. Both Apache Beam and Apache Airflow can be used together since Apache Airflow can be used to trigger the Beam jobs. Since you have custom jobs running, you can configure the beam and airflow for batch orchestration.

Airflow is meant to perform orchestration and also pipeline dependency management while Beam is used to build data pipelines which are executed data processing systems.

Shipra Sarkar
  • 1,385
  • 3
  • 10
1

I believe Composer might be more suited for what you're trying to make. From there, you can launch Dataflow jobs from your environment using Airflow operators (for example, in case you're using Python, you can use the DataflowCreatePythonJobOperator).

Mr. Nobody
  • 200
  • 9