How to filter RDDs using count of keys in a map

Question

I have the following RDD

val reducedListOfCalls: RDD[(String, List[Row])]

The RDDs are:

[(923066800846, List[2016072211,1,923066800846])]

[(923027659472, List[2016072211,1,92328880275]),
  923027659472, List[2016072211,1,92324440275])]

[(923027659475, List[2016072211,1,92328880275]),
 (923027659475, List[2016072211,1,92324430275]),
 (923027659475, List[2016072211,1,92334340275])]

As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.

I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:

[(923027659472, List[2016072211,1,92328880275]),
  923027659472, List[2016072211,1,92324440275])]

[(923027659475, List[2016072211,1,92328880275]),
 (923027659475, List[2016072211,1,92324430275]),
 (923027659475, List[2016072211,1,92334340275])]

I have tried the following:

val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)

but it still given the original list only. The filter seems to have not made any difference.

Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?

In your example you have shown List which contains same elements for identical keys. Have you tried reducebykey? — Amit Kumar, Sep 03 '16 at 16:59
The keys are the same, but the values are different as you can see. I need all values, when the number of keys > 1, reduceByKey did not work for this — sparkDabbler, Sep 03 '16 at 17:40
Are these all printings of the same RDD? It doesn't look like your `List[Row]` is the one holding these multiple tuples, it looks like the RDD simply has a different amount of tuples inside. — Yuval Itzchakov, Sep 03 '16 at 17:57
Why not just use `count`, i.e. `listOfMappedCalls.filter(_.count >= 2)`? — Alfredo Gimenez, Sep 03 '16 at 17:59

score 1 · Accepted Answer · edited May 23 '17 at 10:33

1

You can use aggregateByKey in Spark to count the no of keys.

You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.

Read this post comparing these two functions.

edited May 23 '17 at 10:33

Community

1
1

answered Sep 03 '16 at 21:23

Amit Kumar

2,685
2
37
72

How to filter RDDs using count of keys in a map

1 Answers1