Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
7
votes
2 answers

For each RDD in a DStream how do I convert this to an array or some other typical Java data type?

I would like to convert a DStream into an array, list, etc. so I can then translate it to json and serve it on an endpoint. I'm using apache spark, injecting twitter data. How do I preform this operation on the Dstream statuses? I can't seem to get…
CodingIsAwesome
  • 1,946
  • 7
  • 36
  • 54
6
votes
0 answers

Spark streaming error during job runtime in cluster (yarn resource manager)

I am facing the following error: I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4…
scalacode
  • 1,096
  • 1
  • 16
  • 38
5
votes
1 answer

Concurrent transformations on RDD in foreachDD function of Spark DStream

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI. DstreamRDD1.foreachRDD(new VoidFunction>() { public void…
darkknight444
  • 546
  • 8
  • 21
5
votes
0 answers

Unable to create a dataframe from json dstream using pyspark

I am attempting to create a dataframe from json in dstream but the code below does not seem to get the dataframe right - import sys import json from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import…
5
votes
1 answer

How to carry data streams over multiple batch intervals in Spark Streaming

I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using…
Marco
  • 180
  • 1
  • 8
4
votes
1 answer

Transformed DStream in pyspark gives error when pprint called on it

I'm exploring Spark Streaming through PySpark, and hitting an error when I try to use the transform function with take. I can successfully use sortBy against the DStream via transform and pprint the result. author_counts_sorted_dstream =…
Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
4
votes
1 answer

In spark streaming, what is the difference between foreach and foreachRDD

For example, how would x.foreach(rdd => rdd.cache()) be different from x.foreachRDD(rdd => rdd.cache()) Note that x is a DStream here.
pythonic
  • 20,589
  • 43
  • 136
  • 219
4
votes
1 answer

mock input dstream apache spark

I am trying to mock the input dstream while writing a spark stream unit test. I am able to mock the RDD but when I am trying to convert them into dstream, dstream object is coming up empty. I have used the following code- val lines =…
Y0gesh Gupta
  • 2,184
  • 5
  • 40
  • 56
4
votes
2 answers

Collect results from RDDs in a dstream driver program

I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? def…
user2888475
  • 63
  • 1
  • 4
3
votes
2 answers

Unable to manually commit offset in kafka direct stream, Spark streaming

I am trying to verify the working of manual offset commit. When I try to exit the job either by using thread.sleep()/jssc.stop()/ throwing exceptions in the while loop, I see offsets are being committed. I am just sending couple of message in order…
voldy
  • 359
  • 1
  • 8
  • 21
3
votes
0 answers

Joining a DStream and RDD with checkpointing

I've been battling to perform a join between a DStream and a RDD. To set the scene: Spark - 2.3.1 Python - 3.6.3 RDD I'm reading in the RDD from a CSV file, splitting the records and producing a pair RDD. sku_prices =…
datawookie
  • 1,607
  • 12
  • 20
3
votes
1 answer

Scala to Pyspark

I am trying to perform a join between a Dstream and a static RDD. PySpark #Create static data ip_classification_rdd = sc.parallelize([('log_name','enrichment_success')]) #Broadcast it to all nodes ip_classification_rdd_broadcast =…
steven
  • 644
  • 1
  • 11
  • 23
3
votes
1 answer

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords. I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I…
Mateusz Kubuszok
  • 24,995
  • 4
  • 42
  • 64
3
votes
3 answers

Does DStream's RDD pull entire data created for the batch interval at one shot?

I have gone through this stackoverflow question, as per the answer it creates a DStream with only one RDD for the batch interval. For example: My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic. My question is,…
Shankar
  • 8,529
  • 26
  • 90
  • 159
3
votes
1 answer

Spark streaming if(!rdd.partitions.isEmpty) not working

I'm trying to create a dStream from a kafka server and then do some transformations on that stream. I have included a catch for if the stream is empty (if(!rdd.partitions.isEmpty)); however, even when no events are being published to the kafka…
1
2 3 4 5 6 7 8