Questions tagged [google-dataflow]
49 questions
3
votes
2 answers
How to count the number of rows in the input file of the Google Dataflow file processing?
I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance.
But I…

viveknaskar
- 2,136
- 1
- 20
- 35
2
votes
2 answers
Nested rows using STRUCT are not supported in Dataflow SQL (GCP)
With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic.
Which Dataflow SQL query will create my desired output message?
Pub/Sub input message: {"event_timestamp":1619784049000,…

Marko
- 23
- 3
2
votes
0 answers
Google Dataflow issue
We are newly implementing DataWareHouse on Google bigquery and all our sources are on prim databases. So we are using dataflow for ETL and Maven with the Apache Beam SDK in order to run a 30 pipelines on Google Cloud Dataflow service.
package…

Jambal
- 65
- 6
1
vote
2 answers
Can I dynamically alter log levels in Google Dataflow once the job has started?
From https://cloud.google.com/dataflow/docs/guides/logging I am trying to understand if it is possible to reconfigure the log levels after the job and workers have started.
I suspect the answer is "no" but I wanted to see if anyone knows for sure,…

Eric Kolotyluk
- 1,958
- 2
- 21
- 30
1
vote
1 answer
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object
I have a simple Apache beam programme which read a avro file from gcp cloud storage and write it to big query.
#import print library
import logging
import os
import datetime
#import apache beam library
import apache_beam as beam
from apache_beam…

user546298
- 41
- 4
1
vote
1 answer
Sink for user activity data stream to build Online ML model
I am writing a consumer that consumes (user activity data, (activityid, userid, timestamp, cta, duration) from Google Pub/Sub and I want to create a sink for this such that I can train my ML model in online fashion.
Since this sink is the source…

amor.fati95
- 119
- 1
- 11
1
vote
2 answers
Unacknowledge some pub/sub messages in apache beam pipeline
Currently we have a use case where we want to process some messages at later point of time, after some conditions met.
Is it possible to unacknowledge some pub/sub messages in apache beam pipeline which will be later available after visibility time…

Balasubramanian Naagarajan
- 338
- 3
- 17
1
vote
0 answers
Unable to connect to SSL enabled Elastic Search from Google Dataflow
Unable to connect to SSL enabled Elastic Search using Google Cloud Dataflow. I have used google provide template from git I modified the source code to pass keystore, keystorepassword, truststore to the job along with the Elastic Search…

Praful Janardhanan
- 13
- 3
1
vote
1 answer
Automatic job to delete bigquery table records
Is there a way to schedule deletion of rows from bigquery table based on a column condition? Something like a job to schedule to run every day.
For example, let's say I've a column called creation_date in the table. I need to delete records when…

Vivek T S
- 65
- 8
1
vote
0 answers
Google Cloud Dataflow , apache beam unable to set the BQ query parameter:
The requirement is to read the latest updated records from the BQ and load in to CloudSQL:
Here is are the steps executed,
Read BQ table records which is greater than LAST_UPD_TS.
PCollection read_from_bq = pipeline.apply("read from bq",…

Akshay
- 11
- 2
1
vote
2 answers
How to write BigQuery results to GCS in JSON format using Apache Beam with custom formatting?
I am trying to write BigQuery table records as JSON file in GCS bucket using Apache Beam in python.
I have a BigQuery table - my_project.my_dataset.my_table like this
I wish to write the table records/entries into a JSON file in a GCS bucket…

Gopinath S
- 101
- 1
- 14
1
vote
1 answer
AttributeError: module 'apache_beam' has no attribute 'options'
I am getting the following error when running an Apache Beam pipeline. The full error code is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call…

Ekaba Bisong
- 2,918
- 2
- 23
- 38
1
vote
0 answers
CoGroupByKey always failed on big data (PythonSDK)
I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only…

khiem.nix
- 11
- 1
- 1
1
vote
1 answer
Handling rejects in Dataflow/Apache Beam through dependent pipelines
I have a pipeline that gets data from BigQuery and writes it to GCS, however, if I find any rejects I want to right them to a Bigquery table. I am collecting rejects into a global list variable and later loading the list into BigQuery table. This…

Bob
- 335
- 1
- 4
- 16
1
vote
1 answer
Dataflow Job GCS to Pub/sub Maximum batch size
I'm using the default dataflow template GCS to Pub/Sub. input files in cloud storage having size 300MB and 2-3 millions of rows each one.
when launching the dataflow batch job the following error is raised
Error message from worker:…

MnR
- 21
- 3