0

I have a RDD

 JavaPairRDD<String,Customer> RDD1

Which has one record

cust_id, first_name, lastname
1      "rahul"      "koshaley"

and JavaPairRDD<String,Customer> RDD2

again which has one record

 cust_id , first_name , last_name 
   1        "rahul"          ""

when I do union JavaPairRDD<String,Customer> unionRDD = RDD1.union(RDD2);

The union operation gives me 2 records

1)  1 , "rahul" , "koshaley"
2)  1 , "rahul" , ""

Now when I do distinct on unionRDD ie

JavaPairRDD<String,Customer> distinct = unionRDD.distinct();

will the resulting RDD distinct give me output as

 1 , "rahul" , "koshaley" or

 1 , "rahul" , ""

I want the output RDD to contain the record which has all the values ie 1 , "rahul" , "koshaley" !

EDIT -> THIS QUESTION IS NOT DUPLICATE as I NEED TO KNOW WHICH ONE OF THE DUPLICATE RECORDS WILL SPARK PICK AFTER DISTINCT OPERATION.

  • distinct look for whole record not single field so there is not duplicate record in your `unionRDD`. – avr Aug 02 '16 at 09:36
  • Its a pair RDD , so I will have to apply reduceByKey in my unionRDD right ? – Rahul Koshaley Aug 02 '16 at 09:39
  • I can't see any pairs in the examples you provided. However it depends on input data. How do you handle if there are 3 records and 2 have all the data among them. – avr Aug 02 '16 at 09:48
  • I just edited my question , included pair RDD in my example , now since its a pair how will distinct work , also how to handle the scenario that you explained ? – Rahul Koshaley Aug 02 '16 at 09:50
  • Its all depends on your usecase. Im not understanding what you are trying to achieve? Could you explain with concrete example so that SO community can help you quickly! – avr Aug 02 '16 at 09:55
  • Ok i'll edit the question in some time – Rahul Koshaley Aug 02 '16 at 10:20
  • Hi can you please help me with this ! http://stackoverflow.com/questions/38783654/how-to-combine-3-pair-rdds – Rahul Koshaley Aug 05 '16 at 07:50

0 Answers0