Questions tagged [apache-beam]

Apache Beam is a unified SDK for batch and stream processing. It allows to specify large-scale data processing workflows with a Beam-specific DSL. Beam workflows can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow (a cloud service).

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The programming model behind Beam evolved at Google and was originally known as the “Dataflow Model”. Beam pipelines can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow.

References

Related Tags

4676 questions

115

votes

3 answers

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to…

apache-spark apache-flink apache-beam

asked Apr 24 '17 at 06:26

bluenote10

23,414
14
122
178

votes

4 answers

Apache Airflow or Apache Beam for data processing and job scheduling

I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The…

pandas airflow apache-beam

asked May 09 '18 at 09:19

LouisB

votes

1 answer

Explain Apache Beam python syntax

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code. Can anyone explain what the _ , | , and >> are doing in…

python apache-beam

asked May 05 '17 at 03:32

dobbysock1002

votes

2 answers

What is Apache Beam?

I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is? I tried to google out but unable to get a clear answer.

apache-beam

asked Feb 08 '16 at 07:34

Viswa

1,357
3
18
30

votes

3 answers

Apache Beam : FlatMap vs Map?

I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of FlatMap or Map. Could someone give me an example so I…

google-cloud-dataflow apache-beam

asked Aug 14 '17 at 09:01

Emma Y

votes

2 answers

Apache Beam: DoFn vs PTransform

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?

google-cloud-dataflow apache-beam

asked Dec 08 '17 at 01:57

user_1357

7,766
13
63
106

votes

1 answer

What is the difference between DoFn.Setup and DoFn.StartBundle?

What is the difference between these two annotations? DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements. Uses the word "bundle", takes zero arguments. DoFn.StartBundle Annotation for the method to…

java apache-beam

asked Aug 31 '17 at 16:02

Jacob Marble

28,555
22
67
78

votes

2 answers

google dataflow job cost optimization

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard machines and region for input, output all taken care…

python protocol-buffers google-cloud-dataflow apache-beam avro

asked Jan 09 '21 at 12:39

chethi

votes

1 answer

How do you express denormalization joins in Apache Beam that stretch over long periods of time

For context I've never used Beam. I'm trying to understand how to apply the Beam model to common use cases. Consider you have an unbounded collection of Producers and an unbounded collection of Products such that each Product has a Producer (one to…

apache-beam

asked Nov 14 '17 at 21:45

Steven Noble

10,204
13
45
57

votes

4 answers

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic…

google-cloud-dataflow airflow apache-beam google-cloud-composer

asked Jan 11 '19 at 22:20

user10503628

votes

1 answer

Difference between Apache Beam and Apache Nifi

What are the use cases for Apache Beam and Apache Nifi? It seems both of them are data flow engines. In case both have similar use case, which of the two is better?

apache-nifi apache-beam

asked Apr 05 '17 at 12:32

sanjay

votes

1 answer

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey,…

google-cloud-dataflow apache-beam

asked Jan 10 '19 at 03:39

user_1357

7,766
13
63
106

votes

5 answers

Collecting output from Apache Beam pipeline and displaying it to console

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the…

apache-beam

asked Sep 25 '17 at 13:26

Shamshad Alam

1,684
3
19
31

votes

1 answer

Difference between beam.ParDo and beam.Map in the output type?

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am…

python-2.7 apache-beam apache-beam-io

asked Dec 24 '18 at 11:35

Soliman

1,132
3
12
32

votes

1 answer

Update singleton HashMap using Google pub/sub

I have a use case where I initialise a HashMap that contains a set of lookup data (information about the physical location etc. of IoT devices). This lookup data serves as reference data for a 2nd dataset which is a PCollection. This PCollection is…

java google-cloud-platform bigdata publish-subscribe apache-beam

asked Nov 21 '18 at 23:34

Chris Halcrow

28,994
18
176
206

2 3

…

99 100 Next