Questions tagged [apache-beam]

Apache Beam is a unified SDK for batch and stream processing. It allows to specify large-scale data processing workflows with a Beam-specific DSL. Beam workflows can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow (a cloud service).

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The programming model behind Beam evolved at Google and was originally known as the “Dataflow Model”. Beam pipelines can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow.

References

Related Tags

4676 questions
115
votes
3 answers

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to…
bluenote10
  • 23,414
  • 14
  • 122
  • 178
85
votes
4 answers

Apache Airflow or Apache Beam for data processing and job scheduling

I'm trying to give useful information but I am far from being a data engineer. I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The…
LouisB
  • 973
  • 1
  • 7
  • 9
62
votes
1 answer

Explain Apache Beam python syntax

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code. Can anyone explain what the _ , | , and >> are doing in…
dobbysock1002
  • 907
  • 10
  • 15
56
votes
2 answers

What is Apache Beam?

I was going through the Apache posts and found a new term called Beam. Can anybody explain what exactly Apache Beam is? I tried to google out but unable to get a clear answer.
Viswa
  • 1,357
  • 3
  • 18
  • 30
36
votes
3 answers

Apache Beam : FlatMap vs Map?

I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of FlatMap or Map. Could someone give me an example so I…
Emma Y
  • 555
  • 1
  • 9
  • 16
28
votes
2 answers

Apache Beam: DoFn vs PTransform

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?
user_1357
  • 7,766
  • 13
  • 63
  • 106
26
votes
1 answer

What is the difference between DoFn.Setup and DoFn.StartBundle?

What is the difference between these two annotations? DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements. Uses the word "bundle", takes zero arguments. DoFn.StartBundle Annotation for the method to…
Jacob Marble
  • 28,555
  • 22
  • 67
  • 78
25
votes
2 answers

google dataflow job cost optimization

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard machines and region for input, output all taken care…
21
votes
1 answer

How do you express denormalization joins in Apache Beam that stretch over long periods of time

For context I've never used Beam. I'm trying to understand how to apply the Beam model to common use cases. Consider you have an unbounded collection of Producers and an unbounded collection of Products such that each Product has a Producer (one to…
Steven Noble
  • 10,204
  • 13
  • 45
  • 57
20
votes
4 answers

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic…
user10503628
19
votes
1 answer

Difference between Apache Beam and Apache Nifi

What are the use cases for Apache Beam and Apache Nifi? It seems both of them are data flow engines. In case both have similar use case, which of the two is better?
sanjay
  • 354
  • 2
  • 6
  • 27
17
votes
1 answer

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey,…
user_1357
  • 7,766
  • 13
  • 63
  • 106
17
votes
5 answers

Collecting output from Apache Beam pipeline and displaying it to console

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the…
Shamshad Alam
  • 1,684
  • 3
  • 19
  • 31
16
votes
1 answer

Difference between beam.ParDo and beam.Map in the output type?

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo In the next sample: I am…
Soliman
  • 1,132
  • 3
  • 12
  • 32
16
votes
1 answer

Update singleton HashMap using Google pub/sub

I have a use case where I initialise a HashMap that contains a set of lookup data (information about the physical location etc. of IoT devices). This lookup data serves as reference data for a 2nd dataset which is a PCollection. This PCollection is…
Chris Halcrow
  • 28,994
  • 18
  • 176
  • 206
1
2 3
99 100