Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions

votes

2 answers

For each RDD in a DStream how do I convert this to an array or some other typical Java data type?

I would like to convert a DStream into an array, list, etc. so I can then translate it to json and serve it on an endpoint. I'm using apache spark, injecting twitter data. How do I preform this operation on the Dstream statuses? I can't seem to get…

asked Jul 16 '14 at 05:26

CodingIsAwesome

1,946
7
36
54

votes

0 answers

Spark streaming error during job runtime in cluster (yarn resource manager)

I am facing the following error: I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4…

apache-spark spark-streaming hadoop-yarn dstream

asked Aug 28 '18 at 16:23

scalacode

1,096
1
16
38

votes

1 answer

Concurrent transformations on RDD in foreachDD function of Spark DStream

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI. DstreamRDD1.foreachRDD(new VoidFunction>() { public void…

java apache-spark spark-streaming rdd dstream

asked Nov 22 '16 at 00:55

darkknight444

votes

0 answers

Unable to create a dataframe from json dstream using pyspark

I am attempting to create a dataframe from json in dstream but the code below does not seem to get the dataframe right - import sys import json from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import…

python json apache-spark pyspark dstream

asked Jul 27 '16 at 11:06

SUNIL KUMAR C

votes

1 answer

How to carry data streams over multiple batch intervals in Spark Streaming

I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using…

apache-spark spark-streaming dstream

asked May 20 '16 at 22:22

Marco

votes

1 answer

Transformed DStream in pyspark gives error when pprint called on it

I'm exploring Spark Streaming through PySpark, and hitting an error when I try to use the transform function with take. I can successfully use sortBy against the DStream via transform and pprint the result. author_counts_sorted_dstream =…

apache-spark pyspark spark-streaming dstream

asked Jan 05 '17 at 11:20

Robin Moffatt

30,382
3
65
92

votes

1 answer

In spark streaming, what is the difference between foreach and foreachRDD

For example, how would x.foreach(rdd => rdd.cache()) be different from x.foreachRDD(rdd => rdd.cache()) Note that x is a DStream here.

scala apache-spark rdd dstream bigdata

asked Sep 21 '16 at 11:31

pythonic

20,589
43
136
219

votes

1 answer

mock input dstream apache spark

I am trying to mock the input dstream while writing a spark stream unit test. I am able to mock the RDD but when I am trying to convert them into dstream, dstream object is coming up empty. I have used the following code- val lines =…

unit-testing apache-spark spark-streaming rdd dstream

asked Jun 25 '15 at 11:57

Y0gesh Gupta

2,184
5
40
56

votes

2 answers

Collect results from RDDs in a dstream driver program

I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? def…

apache-spark spark-streaming rdd dstream

asked Feb 25 '15 at 22:09

user2888475

votes

2 answers

Unable to manually commit offset in kafka direct stream, Spark streaming

I am trying to verify the working of manual offset commit. When I try to exit the job either by using thread.sleep()/jssc.stop()/ throwing exceptions in the while loop, I see offsets are being committed. I am just sending couple of message in order…

apache-spark apache-kafka spark-streaming dstream

asked Oct 19 '19 at 00:04

voldy

votes

0 answers

Joining a DStream and RDD with checkpointing

I've been battling to perform a join between a DStream and a RDD. To set the scene: Spark - 2.3.1 Python - 3.6.3 RDD I'm reading in the RDD from a CSV file, splitting the records and producing a pair RDD. sku_prices =…

python-3.x apache-spark pyspark rdd dstream

asked Sep 03 '18 at 07:21

datawookie

1,607
12
20

votes

1 answer

Scala to Pyspark

I am trying to perform a join between a Dstream and a static RDD. PySpark #Create static data ip_classification_rdd = sc.parallelize([('log_name','enrichment_success')]) #Broadcast it to all nodes ip_classification_rdd_broadcast =…

scala apache-spark pyspark spark-streaming dstream

asked Aug 17 '18 at 11:25

steven

votes

1 answer

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords. I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I…

scala apache-spark feature-extraction dstream

asked Dec 06 '16 at 13:21

Mateusz Kubuszok

24,995
4
42
64

votes

3 answers

Does DStream's RDD pull entire data created for the batch interval at one shot?

I have gone through this stackoverflow question, as per the answer it creates a DStream with only one RDD for the batch interval. For example: My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic. My question is,…

apache-spark apache-kafka spark-streaming dstream

asked Nov 13 '16 at 05:20

Shankar

8,529
26
90
159

votes

1 answer

Spark streaming if(!rdd.partitions.isEmpty) not working

I'm trying to create a dStream from a kafka server and then do some transformations on that stream. I have included a catch for if the stream is empty (if(!rdd.partitions.isEmpty)); however, even when no events are being published to the kafka…

scala apache-kafka spark-streaming kafka-consumer-api dstream

asked Nov 02 '16 at 16:53

p3zo

2 3 4 5 6 7 8 Next