Sorting in Spark

Question

There exist some similar questions like this and this but they couldn't provide me sufficient help. Following is a piece of my code.

val output = abc.collect()
output.foreach(tup => println(tup._1 + "  " + math.ceil(tup._2 * 1000)/1000))

Following is a piece of the output.

 5         0.835
 1         0.901
 110       0.797
 7         0.821
 11        0.899
 0         0.871
 32        0.313
 78        0.273
 35698     0.333
 119       0.273

I want to have the output in sorted form. I tried takeOrdered(n) but the output it gives is not what I need. It is sorted but perhaps as string, not numbers. It is something like

 0          0.871
 1          0.901
 10         1.072
 11         0.899
 110        0.797
 111        0.288
 12         0.288
 123        0.273
 14         0.554
 153        0.228

Any Help please?

maybe try using dataframes, see http://stackoverflow.com/questions/30332619/how-to-sort-by-column-in-descending-order-in-spark-sql — maxymoo, Jan 18 '16 at 04:01
What do you need to do once the RDD is sorted? `sortBy` will sort it, as the questions you link to say. It's not clear what your question actually is. — The Archetypal Paul, Jan 18 '16 at 08:05
http://stackoverflow.com/questions/26387753/how-to-reverse-ordering-for-rdd-takeordered You can adjust the ordering function to your needs. In this case, if I understand correctly - convert string to int and order by it. — Tom Ron, Jan 18 '16 at 08:56
@TomRon, I am getting type mismatch error after adding (Ordering[Int]) — Asmat Ali, Jan 18 '16 at 09:34
Because you have strings and you need to define another ordering something like - _.toInt < _.toInt . See this - http://www.scala-lang.org/api/2.11.4/index.html#scala.math.Ordering — Tom Ron, Jan 18 '16 at 12:56

score 1 · Answer 1 · answered Jan 18 '16 at 15:58

1

There is some issue in using takeOrdered(n) with collect(). I tried val output = abc.takeOrdered(10000) and it perfectly worked.

answered Jan 18 '16 at 15:58

Asmat Ali

335
1
11

score 0 · Answer 2 · answered Jan 18 '16 at 06:18

0

Instead of first generating the output from the RDD and then applying sort, you could first sort your RDD and then create the output

val abc=sc.parallelize(Array(("5",0.835),("1",0.901),("110",0.797)))
abc.map{case (k,v)=>(k.toInt,v)}.takeOrdered(3).foreach(println(_))
//(1,0.901)
//(5,0.835)
//(110,0.797)

answered Jan 18 '16 at 06:18

Christian Hirsch

1,996
12
16

This could be fine but the output is just a sample. The actual output is a huge data. So instead of Array(("5",0.835),("1",0.901),("110",0.797)), what could be a general statement that could be applied on all the elements. – Asmat Ali Jan 18 '16 at 07:06
@Asmat Ali: Could you elaborate your comment a little? What kind of problems would appear if you applied the above code to the data set you have in mind? – Christian Hirsch Jan 18 '16 at 07:12
The problem is that I cannot even apply this code to my data set because the data set has millions of nodes and I cannot apply this on each node like you have done to node 5, 1 and 110 as ("5",0.835),("1",0.901),("110",0.797) – Asmat Ali Jan 18 '16 at 07:26

Sorting in Spark

2 Answers2