Questions tagged [direct-runner]

15 questions
3
votes
2 answers

Way to visualize Beam pipeline run with DirectRunner

In GCP we can see the pipeline execution graph. Is the same possible when running locally via DirectRunner?
Dimon Buzz
  • 1,130
  • 3
  • 16
  • 35
3
votes
1 answer

Apache Beam with DirectRunner (SUBPROCESS_SDK) uses only one worker, how do I force it to use all available workers?

The following code: def get_pipeline(workers): pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)]) return beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( …
Chris Su
  • 31
  • 1
1
vote
0 answers

beam python with pub/sub subscription : error with DirectRunner but not DataflowRunner

I have a very simple BEAM Python script, working like a charm when started on the DataflowRunner. It takes datas from a Pub/Sub subscription and print it... And that is all and it works. But, when I start it on the DirectRunner, I get this error…
1
vote
0 answers

Memory profiling in DirectRunner spark mode

I am doing memory profiling using YourKit and to simplify the matters for a Spark application, I am running the app in DirectRunner mode. The machine I am testing on has 32 cores. The captured snapshot looks like: The "direct-runner-worker" has 32…
A_G
  • 2,260
  • 3
  • 23
  • 56
1
vote
1 answer

JAVA - Apache BEAM- GCP: GroupByKey works fine with Direct Runner but fails with Dataflow runner

I tested my code with a Dataflow runner, however it returns an error: > Error message from worker: java.lang.RuntimeException: > org.apache.beam.sdk.util.UserCodeException: > com.fasterxml.jackson.core.JsonParseException: Unrecognized token >…
1
vote
2 answers

Writing to a File in Apache Beam

I am running WordCount program in Windows using Apache Beam via DirectRunner.I can see the output files getting created in a temp folder(under src/main/resources/).But the write to the output file is getting failed. Below is the code…
1
vote
1 answer

How to debug Dataflow/Apache Beam pipeline DoFn functions in eclipse using direct runner

I want to run my pipeline using direct runner in eclipse and put a break point in my DoFn functions and debug execution. I tried to setup direct runner with following steps: Add direct runner maven package Setup maven profile for direct runner in…
PUG
  • 4,301
  • 13
  • 73
  • 115
0
votes
0 answers

Apache Beam Python > 2.38.0 DirectRunner ~ AssertionError: A total of N watermark-pending bundles did not execute

Using Python 3.9 and Apache Beam 2.38.0, the minimal working example below works fine. However, when I use Apache Beam 2.39.0 (or 2.44.0), the example fails with the error AssertionError: A total of 2 watermark-pending bundles did not execute.. When…
Francis
  • 529
  • 4
  • 15
0
votes
1 answer

ApacheBeam ElasticsearchIO is not working with latest elasticsearch

I've been trying to use ElasticsearchIO API's in apache beam pipeline. And I'm unable to connect to elasticsearch. Any help would be great. My JAR…
Deepak
  • 432
  • 1
  • 6
  • 15
0
votes
0 answers

MQTTIO Connection with Apache Beam behaves differently for different topics

When I install a Mosquitto Broker and publish messages to a topic and subscribe the the messages using an Apache Beam MQTTIO pipeline and print the message in the console, I am able to get all the messages. Even after a gap of 5 minutes if I publish…
0
votes
1 answer

ERROR: Could not find a version that satisfies the requirement grpcio<2,>=1.29.0 (from apache-beam[gcp])

I'm encountering an issue while executing an Apache Beam pipeline in Dataflow (using DirectRunner). I have a requirements.txt file containing apache-beam[gcp] among other libraries. Following is the error TraceBack: 2021-08-04 12:08:24.574 | INFO …
0
votes
1 answer

Apache Beam DirectRunner with Cloud Pub/Sub

I am trying to pass data from Cloud Pub/Sub to Google Cloud Storage. When I use runner DataflowRunner , the pipeline gets published to Google Cloud Dataflow and works as expected. However, for some testing I'd like the pipeline to run locally (but…
RadRuss
  • 484
  • 2
  • 6
  • 16
0
votes
0 answers

GCP Dataflow pipeline runs faster in DirectRunner than DataflowRunner

I am quite new to working with Dataflow (GCP). I built a pipeline that runs in DirectRunner mode faster than DataflowRunner mode, I don't know how it can be improved. The pipeline reads data from multiple tables in Bigquery and returns a csv file,…
0
votes
1 answer

Missing options in DirectOptions class

The docs mention the following options: direct_num_workers and direct_running_mode as well as setting the streaming option. All of these are missing from the DirectOptions class Also when trying to set those from args the following exception is…
srfrnk
  • 2,459
  • 1
  • 17
  • 32
0
votes
1 answer

Staging Files to GCS using Dataflow DirectRunner

So when using DataflowRunner, we are staging files to GCS using the filesToStage method, however this does not happen in DirectRunner. Is there a way to have DirectRunner stage files to GCS and use those files similar to the DataflowRunner perhaps…