0

I have the following RDD

val reducedListOfCalls: RDD[(String, List[Row])]

The RDDs are:

[(923066800846, List[2016072211,1,923066800846])]

[(923027659472, List[2016072211,1,92328880275]),
  923027659472, List[2016072211,1,92324440275])]

[(923027659475, List[2016072211,1,92328880275]),
 (923027659475, List[2016072211,1,92324430275]),
 (923027659475, List[2016072211,1,92334340275])]

As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.

I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:

[(923027659472, List[2016072211,1,92328880275]),
  923027659472, List[2016072211,1,92324440275])]

[(923027659475, List[2016072211,1,92328880275]),
 (923027659475, List[2016072211,1,92324430275]),
 (923027659475, List[2016072211,1,92334340275])]

I have tried the following:

val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)

but it still given the original list only. The filter seems to have not made any difference.

Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?

sparkDabbler
  • 518
  • 2
  • 7
  • 20
  • In your example you have shown List which contains same elements for identical keys. Have you tried reducebykey? – Amit Kumar Sep 03 '16 at 16:59
  • The keys are the same, but the values are different as you can see. I need all values, when the number of keys > 1, reduceByKey did not work for this – sparkDabbler Sep 03 '16 at 17:40
  • Are these all printings of the same RDD? It doesn't look like your `List[Row]` is the one holding these multiple tuples, it looks like the RDD simply has a different amount of tuples inside. – Yuval Itzchakov Sep 03 '16 at 17:57
  • 1
    Why not just use `count`, i.e. `listOfMappedCalls.filter(_.count >= 2)`? – Alfredo Gimenez Sep 03 '16 at 17:59

1 Answers1

1

You can use aggregateByKey in Spark to count the no of keys.

You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.

Read this post comparing these two functions.

Community
  • 1
  • 1
Amit Kumar
  • 2,685
  • 2
  • 37
  • 72