Highest Voted 'google-dataflow' Questions

3

votes

2 answers

How to count the number of rows in the input file of the Google Dataflow file processing?

I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance. But I…

asked Sep 17 '20 at 18:15

viveknaskar

2,136
1
20
35

2

votes

2 answers

Nested rows using STRUCT are not supported in Dataflow SQL (GCP)

With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic. Which Dataflow SQL query will create my desired output message? Pub/Sub input message: {"event_timestamp":1619784049000,…

google-cloud-dataflow google-cloud-pubsub google-dataflow

asked May 07 '21 at 08:40

Marko

23
3

2

votes

0 answers

Google Dataflow issue

We are newly implementing DataWareHouse on Google bigquery and all our sources are on prim databases. So we are using dataflow for ETL and Maven with the Apache Beam SDK in order to run a 30 pipelines on Google Cloud Dataflow service. package…

google-cloud-platform google-cloud-dataflow google-dataflow

asked Jul 09 '20 at 01:46

Jambal

65
6

1

vote

2 answers

Can I dynamically alter log levels in Google Dataflow once the job has started?

From https://cloud.google.com/dataflow/docs/guides/logging I am trying to understand if it is possible to reconfigure the log levels after the job and workers have started. I suspect the answer is "no" but I wanted to see if anyone knows for sure,…

google-cloud-platform google-dataflow

asked Aug 15 '22 at 18:31

Eric Kolotyluk

1,958
2
21
30

1

vote

1 answer

Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object

I have a simple Apache beam programme which read a avro file from gcp cloud storage and write it to big query. #import print library import logging import os import datetime #import apache beam library import apache_beam as beam from apache_beam…

google-cloud-platform google-cloud-storage google-dataflow

asked Jul 03 '22 at 09:51

user546298

41
4

1

vote

1 answer

Sink for user activity data stream to build Online ML model

I am writing a consumer that consumes (user activity data, (activityid, userid, timestamp, cta, duration) from Google Pub/Sub and I want to create a sink for this such that I can train my ML model in online fashion. Since this sink is the source…

google-cloud-bigtable stream-processing google-dataflow event-stream-processing online-machine-learning

asked May 27 '22 at 12:05

amor.fati95

119
1
11

1

vote

2 answers

Unacknowledge some pub/sub messages in apache beam pipeline

Currently we have a use case where we want to process some messages at later point of time, after some conditions met. Is it possible to unacknowledge some pub/sub messages in apache beam pipeline which will be later available after visibility time…

java apache-beam google-cloud-pubsub google-dataflow

asked Apr 11 '22 at 11:15

Balasubramanian Naagarajan

338
3
17

1

vote

0 answers

Unable to connect to SSL enabled Elastic Search from Google Dataflow

Unable to connect to SSL enabled Elastic Search using Google Cloud Dataflow. I have used google provide template from git I modified the source code to pass keystore, keystorepassword, truststore to the job along with the Elastic Search…

elasticsearch google-cloud-platform google-cloud-dataflow apache-beam google-dataflow

asked Apr 03 '22 at 21:20

Praful Janardhanan

13
3

1

vote

1 answer

Automatic job to delete bigquery table records

Is there a way to schedule deletion of rows from bigquery table based on a column condition? Something like a job to schedule to run every day. For example, let's say I've a column called creation_date in the table. I need to delete records when…

google-cloud-platform google-bigquery google-cloud-functions google-cloud-run google-dataflow

asked Dec 15 '21 at 13:51

Vivek T S

65
8

1

vote

0 answers

Google Cloud Dataflow , apache beam unable to set the BQ query parameter:

The requirement is to read the latest updated records from the BQ and load in to CloudSQL: Here is are the steps executed, Read BQ table records which is greater than LAST_UPD_TS. PCollection read_from_bq = pipeline.apply("read from bq",…

google-bigquery apache-beam google-cloud-sql dataflow google-dataflow

asked Nov 22 '21 at 04:49

Akshay

11
2

1

vote

2 answers

How to write BigQuery results to GCS in JSON format using Apache Beam with custom formatting?

I am trying to write BigQuery table records as JSON file in GCS bucket using Apache Beam in python. I have a BigQuery table - my_project.my_dataset.my_table like this I wish to write the table records/entries into a JSON file in a GCS bucket…

python-3.x google-cloud-platform google-bigquery apache-beam google-dataflow

asked Jul 12 '21 at 14:53

Gopinath S

101
1
14

1

vote

1 answer

AttributeError: module 'apache_beam' has no attribute 'options'

I am getting the following error when running an Apache Beam pipeline. The full error code is: --------------------------------------------------------------------------- AttributeError Traceback (most recent call…

python apache-beam dataflow google-dataflow

asked Mar 11 '21 at 16:47

Ekaba Bisong

2,918
2
23
38

1

vote

0 answers

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input. My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB. I tried to limit only use 300 file then it run just fine. In case of fail, the logs on GCP dataflow only…

debugging google-cloud-dataflow google-dataflow

asked Oct 22 '20 at 06:56

khiem.nix

11
1
1

1

vote

1 answer

Handling rejects in Dataflow/Apache Beam through dependent pipelines

I have a pipeline that gets data from BigQuery and writes it to GCS, however, if I find any rejects I want to right them to a Bigquery table. I am collecting rejects into a global list variable and later loading the list into BigQuery table. This…

python google-cloud-platform apache-beam dataflow google-dataflow

asked Sep 19 '20 at 00:35

Bob

335
1
4
16

1

vote

1 answer

Dataflow Job GCS to Pub/sub Maximum batch size

I'm using the default dataflow template GCS to Pub/Sub. input files in cloud storage having size 300MB and 2-3 millions of rows each one. when launching the dataflow batch job the following error is raised Error message from worker:…

google-cloud-platform google-cloud-storage google-cloud-pubsub google-dataflow

asked Aug 26 '20 at 09:18

MnR

21
3

Questions tagged [google-dataflow]