Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
79
votes
7 answers

What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

I am using Google Data Flow to implement an ETL data ware house solution. Looking into google cloud offering, it seems DataProc can also do the same thing. It also seems DataProc is little bit cheaper than DataFlow. Does anybody know the pros /…
36
votes
3 answers

Apache Beam : FlatMap vs Map?

I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of FlatMap or Map. Could someone give me an example so I…
Emma Y
  • 555
  • 1
  • 9
  • 16
32
votes
4 answers

Google Dataflow vs Apache Spark

I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs. I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning. I…
28
votes
2 answers

Apache Beam: DoFn vs PTransform

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?
user_1357
  • 7,766
  • 13
  • 63
  • 106
25
votes
2 answers

google dataflow job cost optimization

I have run the below code for 522 gzip files of size 100 GB and after decompressing, it will be around 320 GB data and data in protobuf format and write the output to GCS. I have used n1 standard machines and region for input, output all taken care…
21
votes
2 answers

Benefits with Dataflow over cloud functions when moving data?

I'm relatively new to GCP and just starting to setup/evaluate my organizations architecture on GCP. Scenario: Data will flow into a pub/sub topic (high frequency, low amount of data). The goal is to move that data into Big Table. From my…
21
votes
3 answers

Pros/cons of streaming into BigQuery directly vs through Google Pub/Sub + Dataflow

We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery. I can see 3 different ways of doing that : Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described…
20
votes
4 answers

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation. Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic…
user10503628
19
votes
2 answers

How to delete a gcloud Dataflow job?

The Dataflow jobs are cluttered all over my dashboard, and I'd like to delete the failed jobs from my project. But in the dashboard, I don't see any option to delete the Dataflow job. I'm looking for something like below at least, $ gcloud beta…
Vijin Paulraj
  • 4,469
  • 5
  • 39
  • 54
18
votes
4 answers

Dataflow setting Controller Service Account

I try to set up controller service account for Dataflow. In my dataflow options I have: options.setGcpCredential(GoogleCredentials.fromStream( new FileInputStream("key.json")).createScoped(someArrays));…
18
votes
4 answers

easiest way to schedule a Google Cloud Dataflow job

I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much. I was thinking about just running the pipeline from a…
CCC
  • 2,642
  • 7
  • 40
  • 62
17
votes
1 answer

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as: A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey,…
user_1357
  • 7,766
  • 13
  • 63
  • 106
17
votes
2 answers

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn

I am trying write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. I know that TextIO/AvroIO do not support streaming pipelines. However, I read in [1] that it is possible to write to GCS in a streaming pipeline from a…
16
votes
1 answer

What is the watermark heuristic for PubsubIO running on GCD?

Hi I'm trying to run a pipeline where I am calculating diffs between messages that are published to pubsub with 30sec heartbeats* (10K streams, each heartbeating every 30sec). I don't care about 100% data completeness, but I'd like to understand…
Keith Berkoben
  • 161
  • 1
  • 3
14
votes
2 answers

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this: The first step is doing some basic processing and takes about 2 seconds per message to get to…
Brecht Coghe
  • 286
  • 1
  • 7
1
2 3
99 100