Scio is a Scala API for Google Cloud Dataflow and Apache Beam inspired by Spark and Scalding.
Questions tagged [spotify-scio]
80 questions
6
votes
1 answer
How does dataflow trigger AfterProcessingTime.pastFirstElementInPane() work?
In the Dataflow streaming world.
My understanding when I say:
Window.into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(15))
is that for a fixed…

Anil Muppalla
- 416
- 4
- 14
4
votes
1 answer
Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error
purpose:
I want to load stream data, then add a key and then count them by key.
problem:
Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data)
. Because it…

Saeed Mohtasham
- 1,693
- 16
- 27
4
votes
1 answer
Inconsistent behavior on the functioning of the dataflow templates?
When i create a dataflow template, the characteristics of Runtime parameters are not persisted in the template file.
At runtime, if i try to pass a value for this parameter, i take a 400 error
I'm using Scio 0.3.2, scala 2.11.11 with apache beam…

Damien GOUYETTE
- 467
- 1
- 3
- 11
4
votes
1 answer
Google Pub/Sub to Dataflow, avoid duplicates with Record ID
I'm trying to build a Streaming Dataflow Job which read events from Pub/Sub and write them into BigQuery.
According to the documentation, Dataflow can detect duplicate messages delivery if a Record ID is used (see:…

Vincent Spiewak
- 231
- 1
- 3
- 7
4
votes
1 answer
error writing PubSub stream to Cloud Storage using Dataflow
Using SCIO from spotify to write a job for Dataflow , following 2 examples e.g1 and e.g2 to write a PubSub stream to GCS, but get the following error for the below code
Error
Exception in thread "main" java.lang.IllegalArgumentException: Write can…

DAR
- 106
- 1
- 10
3
votes
1 answer
How to limit PCollection in Apache Beam as soon as possible?
I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as…

Marcin Zablocki
- 10,171
- 1
- 37
- 47
3
votes
0 answers
sbt gets stuck and goes OutOfMemory
sbt gets stuck when trying to compile this project (snowplow/snowplow, branch sbt_issue) and goes OutOfMemory:
$ sbt compile
[info] Loading settings for project global-plugins from metals.sbt ...
[info] Loading global plugins from…

benjben
- 181
- 7
3
votes
2 answers
Join batch data with data stored in BigTable
I have growing data in GCS and will have a batch job that runs lets say every day to process 1 million of articles increment. I need to get additional information for the keys from BigTable (containing billions of records). Is it feasible to do just…

Adam Horky
- 113
- 6
3
votes
1 answer
Maintaining a global state within Apache Beam
We have a PubSub topic with events sinking into BigQuery (though particular DB is almost irrelevant here). Events can come with new unknown properties that eventually should end up as separate BigQuery columns.
So, basically I have two questions…

chuwy
- 6,310
- 4
- 20
- 29
3
votes
2 answers
Streaming data from CloudSql into Dataflow
We are currently exploring how we can process a big amount of data store in a Google Cloud SQL database (MySQL) using Apache Beam/Google Dataflow.
The database stores about 200GB of data in a single table.
We successfully read rows from the database…

Scarysize
- 4,131
- 25
- 37
3
votes
2 answers
How to match multiple files with names using TextIO.Read in Cloud Dataflow
I have a gcs folder as below:
gs:////dt=2017-12-01/part-0000.tsv
/dt=2017-12-02/part-0000.tsv
/dt=2017-12-03/part-0000.tsv
…

pishen
- 319
- 3
- 16
3
votes
0 answers
Dataflow / apache beam Trigger window on number of bytes in window
I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size
I want the result to be in GCS vertically partition accordingly:
Schema/version/year/month/day/
under that…

Luke De Feo
- 2,025
- 3
- 22
- 40
2
votes
1 answer
How to convert SCollection[String] to Seq[String] or List[String]?
I want to convert SCollection[String] to Seq[String] or List[String].
For example, I have a variable called ids.
val ids: SCollection[String] = ~
ids.saveAsTextFile(pathToGCS)
When I save it to Cloud Storage, the contents of the text file are a…

user39613
- 21
- 2
2
votes
1 answer
Beam SQL Not Firing
I am building a simple prototype wherein I am reading data from Pubsub and using BeamSQL, code snippet as below
val eventStream: SCollection[String] = sc.pubsubSubscription[String]("projects/jayadeep-etl-platform/subscriptions/orders-dataflow")
…

Jayadeep Jayaraman
- 2,747
- 3
- 15
- 26
2
votes
2 answers
Custom timestamp and windowing for Pub/Sub in DataFlow (Apache Beam)
I want to implement the following scenario using streaming pipeline in Apache Beam (and running it on Google DataFlow):
Read messages from Pub/Sub (JSON strings)
Deserialize JSONs
Use custom field (say timeStamp) as a timestamp value for the…

Marcin Zablocki
- 10,171
- 1
- 37
- 47