Questions tagged [spotify-scio]

Scio is a Scala API for Google Cloud Dataflow and Apache Beam inspired by Spark and Scalding.

80 questions
6
votes
1 answer

How does dataflow trigger AfterProcessingTime.pastFirstElementInPane() work?

In the Dataflow streaming world. My understanding when I say: Window.into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(15)) is that for a fixed…
Anil Muppalla
  • 416
  • 4
  • 14
4
votes
1 answer

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

purpose: I want to load stream data, then add a key and then count them by key. problem: Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data) . Because it…
Saeed Mohtasham
  • 1,693
  • 16
  • 27
4
votes
1 answer

Inconsistent behavior on the functioning of the dataflow templates?

When i create a dataflow template, the characteristics of Runtime parameters are not persisted in the template file. At runtime, if i try to pass a value for this parameter, i take a 400 error I'm using Scio 0.3.2, scala 2.11.11 with apache beam…
4
votes
1 answer

Google Pub/Sub to Dataflow, avoid duplicates with Record ID

I'm trying to build a Streaming Dataflow Job which read events from Pub/Sub and write them into BigQuery. According to the documentation, Dataflow can detect duplicate messages delivery if a Record ID is used (see:…
4
votes
1 answer

error writing PubSub stream to Cloud Storage using Dataflow

Using SCIO from spotify to write a job for Dataflow , following 2 examples e.g1 and e.g2 to write a PubSub stream to GCS, but get the following error for the below code Error Exception in thread "main" java.lang.IllegalArgumentException: Write can…
3
votes
1 answer

How to limit PCollection in Apache Beam as soon as possible?

I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as…
Marcin Zablocki
  • 10,171
  • 1
  • 37
  • 47
3
votes
0 answers

sbt gets stuck and goes OutOfMemory

sbt gets stuck when trying to compile this project (snowplow/snowplow, branch sbt_issue) and goes OutOfMemory: $ sbt compile [info] Loading settings for project global-plugins from metals.sbt ... [info] Loading global plugins from…
benjben
  • 181
  • 7
3
votes
2 answers

Join batch data with data stored in BigTable

I have growing data in GCS and will have a batch job that runs lets say every day to process 1 million of articles increment. I need to get additional information for the keys from BigTable (containing billions of records). Is it feasible to do just…
3
votes
1 answer

Maintaining a global state within Apache Beam

We have a PubSub topic with events sinking into BigQuery (though particular DB is almost irrelevant here). Events can come with new unknown properties that eventually should end up as separate BigQuery columns. So, basically I have two questions…
3
votes
2 answers

Streaming data from CloudSql into Dataflow

We are currently exploring how we can process a big amount of data store in a Google Cloud SQL database (MySQL) using Apache Beam/Google Dataflow. The database stores about 200GB of data in a single table. We successfully read rows from the database…
Scarysize
  • 4,131
  • 25
  • 37
3
votes
2 answers

How to match multiple files with names using TextIO.Read in Cloud Dataflow

I have a gcs folder as below: gs:////dt=2017-12-01/part-0000.tsv /dt=2017-12-02/part-0000.tsv /dt=2017-12-03/part-0000.tsv …
3
votes
0 answers

Dataflow / apache beam Trigger window on number of bytes in window

I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size I want the result to be in GCS vertically partition accordingly: Schema/version/year/month/day/ under that…
Luke De Feo
  • 2,025
  • 3
  • 22
  • 40
2
votes
1 answer

How to convert SCollection[String] to Seq[String] or List[String]?

I want to convert SCollection[String] to Seq[String] or List[String]. For example, I have a variable called ids. val ids: SCollection[String] = ~ ids.saveAsTextFile(pathToGCS) When I save it to Cloud Storage, the contents of the text file are a…
user39613
  • 21
  • 2
2
votes
1 answer

Beam SQL Not Firing

I am building a simple prototype wherein I am reading data from Pubsub and using BeamSQL, code snippet as below val eventStream: SCollection[String] = sc.pubsubSubscription[String]("projects/jayadeep-etl-platform/subscriptions/orders-dataflow") …
2
votes
2 answers

Custom timestamp and windowing for Pub/Sub in DataFlow (Apache Beam)

I want to implement the following scenario using streaming pipeline in Apache Beam (and running it on Google DataFlow): Read messages from Pub/Sub (JSON strings) Deserialize JSONs Use custom field (say timeStamp) as a timestamp value for the…
Marcin Zablocki
  • 10,171
  • 1
  • 37
  • 47
1
2 3 4 5 6