0

I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). I am a newbie and don't have experience in programming in scala and spark. So I will be very thankful for the help.

Problem:

I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). I then store its values (i.e., many HashSet values) in an RDD. The code is as below

val rdd1 = sc.parallelize(Seq(hMap1.values()))

Now I have another HashMap, hMap2, of same type i.e., HashMap>. Its values are also stored in an RDD as

val rdd2 = sc.parallelize(Seq(hMap2.values()))

I want to know how can I intersect the values of hMap1 and hMap2

For example:

Input:

the data in rdd1 = [2, 3], [1, 109], [88, 17]

and data in rdd2 = [2, 3], [1, 109], [5,45]

Output

so the output = [2, 3], [1, 109]

icarumbas
  • 1,777
  • 3
  • 17
  • 30
Kifayat
  • 1
  • 2

1 Answers1

0

Problem statement

My understanding of your question is the following:

Given two RDDs of type RDD[Set[Integer]], how can I produce an RDD of their common records.

Sample data

Two RDDs generated by

val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))

Possible solution

If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2) if your RDDs are as I thought. This is what I tried on a spark-shell with Spark 2.2.0:

rdd1.intersection(rdd2).collect

which yielded the output:

Array(Set(2, 3), Set(1, 109))

This works because Spark can compare elements of type Set[Integer], but note that this is not generalisable to any object Set[MyObject] unless you defined the equality contract of MyObject.

Alexandre Dupriez
  • 3,026
  • 20
  • 25
  • Thanks Alex for ur in-time response it helped alot. Actually, I hve 2 classes: one in Scala and other in Java. And the source data is in HashMap of Java class where key is integer and its corresponding value is HashSet of integers. So when I retrieve the values into Scala, i was getting error. And the main error figured out was that HashMap values are not 'serializable' so i could not obtain the data as desired. As a solution : i stored the HashMap Values of Java class in a public static variable and then directly called that variable in Scala class. – Kifayat Nov 14 '17 at 01:50
  • Thanks for the update - I am not sure I understand why the Java's `Hashset` of `Integer` could not be serialised tough? Feel free to share the exception if you have time - I'm just being curious here – Alexandre Dupriez Nov 14 '17 at 09:33
  • Hi Alex, i was very busy so could not reply in time. I will try to send you the error till next week. Anyways, i have another question, if you could answer me, it will be great help too. My question is https://stackoverflow.com/questions/47324904/generating-cartesian-product-for-an-rdd-whose-elements-are-sets-in-scala – Kifayat Nov 16 '17 at 08:31