1

I have a RDD with Tuple as follows

(a, 1), (a, 2), (b,1)

How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)

What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.

Saqib Ali
  • 3,953
  • 10
  • 55
  • 100

3 Answers3

3

Here's what I threw together in Scala.

sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
    .reduceByKey((k1,k2) => k1)
    .collect()

Outputs

Array[(String, Int)] = Array((a,1), (b,1))
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Can be simplified - `reduceByKey` can be called on the original tuple RDD, then you won't need the map at the beginning and end: `input.reduceByKey((k1,k2) => k1).take(2)` is enough – Tzach Zohar Aug 01 '16 at 07:17
0

As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.

val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]

val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]

val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]

Or in short

sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
  .groupByKey()
  .map((key, iter) => (key, iter.head))
sarveshseri
  • 13,738
  • 28
  • 47
-2

It is easy..... you just need to use the function just like the bellow:

val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Lyen
  • 1
  • If it's an RDD, the collectAsMap may well blow up as the data won't fit on one node. If it fits, there would be no need to use Spark – The Archetypal Paul Aug 01 '16 at 06:47
  • I do not get what result you really need,if you just want to get the distinct key and ignore the value,there are already anwsers in the commemts – Lyen Aug 01 '16 at 07:01