I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD
of Pair
, your RDD
has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions
. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy..... you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)