I want to join two sets by applying broadcast variable. I am trying to implement the first suggestion from Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
val emp_newBC = sc.broadcast(emp_new.collectAsMap())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
} yield ((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
However as mentioned here: broadcast variable fails to take all data I need to use collect() rather than collectAsMAp(). I tried to adjust my code as below:
val emp_newBC = sc.broadcast(emp_new.collect())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
amk = m.indexOf(t)
} yield ((w + '-' + emp_newBC.value(amk)),1) //yield ((t, w), (m.get(t).get)) //((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
But it seems m.contains(t) does not respond. How can I remedy this?
Thanks in advance.