Join two RDDs on custom function - SPARK

Question

Is it possible to Join two RDDs in Spark on a custom function? I have two big RDDs with a string as key. I want to join them not using the classic Join but a custom function like:

def my_func(a,b):
    return Lev.distance(a,b) < 2

result_rdd = rdd1.join(rdd2, my_func)

If it's not possible, is there any alternative that will continue to use the benefits of spark clusters? I wrote something like this but pyspark will not be able to distribuite the work on my small cluster.

def custom_join(rdd1, rdd2, my_func):
    a = rdd1.sortByKey().collect()
    b = rdd2.sortByKey().collect()
    i = 0
    j = 0
    res = []
    while i < len(a) and j < len(b):
        if my_func(a[i][0],b[j][0]):
            res += [((a[i][0],b[j][0]),(a[i][1],b[j][1]))]
            i+=1
            j+=1
        elif a[i][0] < b[j][0]:
            i+=1
        else:
            j+=1

    return sc.parallelize(res)

Thanks in advance (and sorry for my english because I'm italian)

score 2 · Accepted Answer · answered Apr 07 '17 at 10:24

2

You can use cartesian and then filter based on conditions.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("b", 3)])

def customFunc(x):
    # You may use any condition here
    return x[0][0] ==x[1][0]

print(x.join(y).collect()) # normal join
# replicating join with cartesian
print(x.cartesian(y).filter(customFunc).flatMap(lambda x:x).groupByKey().mapValues(tuple).collect())

Output:

[('b', (4, 3)), ('a', (1, 2))]
[('a', (1, 2)), ('b', (4, 3))]

answered Apr 07 '17 at 10:24

Himaprasoon

2,609
3
25
46

Thanks, but I think the cartesian product will be very inefficient compared to the join. I'm working on a database with about 2M entries. – Luca Di Liello Apr 07 '17 at 11:46
Is it possible to use dataframe api ? – Himaprasoon Apr 07 '17 at 12:04
Are dataframes compatible with cluster computation? – Luca Di Liello Apr 07 '17 at 13:04
Ya it is. But even using dataframe api it will lead to cartesian product . Sorry http://stackoverflow.com/questions/32952080/why-using-a-udf-in-a-sql-query-leads-to-cartesian-product – Himaprasoon Apr 07 '17 at 13:56

Join two RDDs on custom function - SPARK

1 Answers1