How to sort data in spark streaming

Question

I am new to spark, and try to write some example code base on spark and spark streaming.

So far, I have implemented sorting function in spark, here is the code:

  def sort(listSize: Int, slice: Int): Unit = {
    val conf = new SparkConf().setAppName(getClass.getName)
    val spark = new SparkContext(conf)
    val data = genRandom(listSize)
    val distData = spark.parallelize(data, slice)
    val result = distData.sortBy(x => x, true)
    val finalResult = result.collect()
    val step5 = System.currentTimeMillis()
    printlnArray(finalResult, 0, 10)
    spark.stop()
  }

  /**
   * generate random number
   * @return
   */
  def genRandom(listSize: Int): List[Int] = {
    val range = 100000
    var listBuffer = new ListBuffer[Int]
    val random = new Random()
    for (i <- 1 to listSize) listBuffer += random.nextInt(range)
    listBuffer.toList
  }

  def printlnArray(list: Array[Int], start: Int, offset: Int) {
    for (i <- start until start + offset) println(">>>>>>>>> list : " + i + " | " + list(i))
  }

I have a trouble on implementing sort function on spark streaming. As I know, spark RDD provide sort API in spark core, but there is not such API in spark streaming, Do anyone know how to do it ? Thanks

This is a dump question, but after google on web, I does not find an right answer. If anyone know how to solve it, thanks for your help.

Do you want to sort each `microbatch` of the stream or do you want to sort the whole stream? The latter is - in terms of stream processing in general - afaik not possible. — dwegener, Jun 02 '15 at 17:24

score 5 · Answer 1 · edited Jul 14 '21 at 11:08

5

You can leverage the transform function of a DStream to transform it by using the underlying RDDs.

For instance

myDStream.transform(rdd => rdd.sortByKey())

edited Jul 14 '21 at 11:08

Andrii Abramov

10,019
9
74
96

answered Aug 31 '15 at 14:22

Marco

2,189
5
25
44

Thanks @Hawk66! Wanted to know how could we perform singular operations on DStreams? like say, if we want just the topmost entry in each microbatch. `myDStream.transform(rdd=>rdd.sortByKey()).top(1)` OR `myDStream.transform(rdd=>rdd.sortByKey()).transform(rdd=>rdd.top(1))`? Nothing works so far – Dexter Feb 28 '17 at 02:23
never mind. Got my answer here in this SO link http://stackoverflow.com/questions/41483746/transformed-dstream-in-pyspark-gives-error-when-pprint-called-on-it – Dexter Feb 28 '17 at 03:33

How to sort data in spark streaming

1 Answers1