0

I'm new to spark. And my problem is the following. I have a pairRDD with data already. And now, i need to apply a map transformation to it, so that I get back a new RDD with a new value that depends on some inner transformations inside the map function, as following. (pseudo-code)

JavaPairRDD<Long,Long> originalRDD = .... //the one i load from the dataset
JavaPairRDD<Long,Long> anotherrdd = ......; //the source of tuples
JavaPairRDD<Tuple2<Long, Long>, Long> result = anotherrdd
                .mapToPair(tuple-> {
                    JavaRDD<Long> aux1;
                    JavaRDD<Long> aux2;
                    aux1 = originalRDD.filter(T -> T._1.equals(tuple._1)).values().flatMap(f -> f);
                    aux2 = originalRDD.filter(T -> T._2.equals(tuple._2)).values().flatMap(f -> f);
                    JavaPairRDD<Long,Long> auxfinal = aux1.intersect(aux2);
                    //some other code here that process auxfinal and returns a 
                    //new tuple to RESULT(rdd)
                }); 

If I code this way, do the excecutor creates new jobs (for the filters and intersections) and launches them itself?? or will the spark context be aware of this and will create new jobs for that??? I've been reading official documentation and they don't clarify what happens in this case. Thanks in advance!

SudhirR
  • 760
  • 4
  • 11
Hernan Z
  • 3
  • 4

1 Answers1

0

Actually, the only one who can make tasks is the master which is called the context. It means that you can not do such a thing to declare another RDD or even use it in one another.

Moreover, the thing you expected is the join operation. It's as like as relational databases join operation. In other words, there are two tables and there is a common column between those two and you can find similar tuples based on that column. To do so, you need to have two RDDs which both have a key for all their objects.

join(otherDataset, [numPartitions]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

for more information you can use this Join two ordinary RDDs with/without Spark SQL too

  • The problem is, that i need for *each* tuple in "anotherRDD" to evaluate the (K,V) separratedly filtering by a given criteria, then intersecting the data in common and then mapping the values with the amount of "common" data onto a new RDD. The Join transformation doesn't help with my problem. – Hernan Z Sep 19 '18 at 22:15
  • See the problem as a nested FOR loop (as we'd do in a normal java program). "for each tuple of this RDD, get the key, filter from the originalRDD (using K) all the data and bring those tuples to aux1. Then filter from originalRDD (using V) all the data and bring those tuples to aux2. Get the intersection of both auxs. and return the observed K,V onto result, with some counting number." – Hernan Z Sep 19 '18 at 22:15