Getting first n distinct Key Tuples in Scala Spark

Question

I have a RDD with Tuple as follows

(a, 1), (a, 2), (b,1)

How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)

What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.

The example at the bottom of this post should help. http://stackoverflow.com/a/30960114/2308683 — OneCricketeer, Aug 01 '16 at 04:38
@cricket_007 distinct will compare the entire tuple. What i need to compare is just the key for the tuple. — Saqib Ali, Aug 01 '16 at 04:59

OneCricketeer · Accepted Answer · 2016-08-01T07:21:45.000

3

Here's what I threw together in Scala.

sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
    .reduceByKey((k1,k2) => k1)
    .collect()

Outputs

Array[(String, Int)] = Array((a,1), (b,1))

edited Aug 01 '16 at 07:21

answered Aug 01 '16 at 05:34

OneCricketeer

179,855
19
132
245

Can be simplified - `reduceByKey` can be called on the original tuple RDD, then you won't need the map at the beginning and end: `input.reduceByKey((k1,k2) => k1).take(2)` is enough – Tzach Zohar Aug 01 '16 at 07:17

score 0 · Answer 2 · answered Aug 01 '16 at 06:23

As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.

val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]

val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]

val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]

Or in short

sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
  .groupByKey()
  .map((key, iter) => (key, iter.head))

score -2 · Answer 3 · edited Aug 01 '16 at 08:13

-2

It is easy..... you just need to use the function just like the bellow:

val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

edited Aug 01 '16 at 08:13

David Arenburg

91,361
17
137
196

answered Aug 01 '16 at 06:41

Lyen

1

If it's an RDD, the collectAsMap may well blow up as the data won't fit on one node. If it fits, there would be no need to use Spark – The Archetypal Paul Aug 01 '16 at 06:47
I do not get what result you really need,if you just want to get the distinct key and ignore the value,there are already anwsers in the commemts – Lyen Aug 01 '16 at 07:01

Getting first n distinct Key Tuples in Scala Spark

3 Answers3