I have a large RDD (rdd_1) and a filtered subset (rdd_2) of it. I want to join rdd_1 and rdd_2 on a different field.
Let's say records are of format {'first_name':<>, 'last_name':<>}. We want to find all names that have the same last name as all the 'jack's.
names = sc.textfile(RAW_DATA)
jack = names.filter(lambda v: v['first_name'] == 'jack')
Option 1
jack_last_names = jack.map(operator.itergetter('last_name').distinct().collect()
last_names_bc = sc.broadcast(set(jack_last_names))
final = names.filter(lambda v:v['last_name'] in last_names_bc.value)
Currently, I broadcast rdd_2 and filter rdd_1 by it. Trouble is that in order to broadcast rdd_2, I have to first collect() it on the driver and it causes driver to run out of memory.
Is there a way to broadcast an RDD without first collect()ing it on the driver?
Option 2
final = jack.keyBy(operator.itemgetter('last_name').join(names.keyBy(operator.itemgetter('last_name')
My other option is rdd_1.join(rdd_2) but rdd_1 is way too big to shuffle.
When we run rdd_1.join(rdd_2) do both rdd_1 and rdd_2 get hash partitioned and shuffled?
Thanks!