Efficient way for nested loop on a pyspark rdd

Question

I have a simple python list of string as follows (myTexts). I want apply nested looping on the array and return only those pairs of strings that matches a particular threshold value. Following is the example of my existing python code:

myTexts = ['abc', 'cde', 'ccc', 'efg', 'eee', 'kkk']
someThrehold = 0.5
resultantPairs = []

def LCS(string1, string2):
   #returns a numeric similarity value,x (within range [0,1]) based on 
   #passed strings: string1, string2
   return x


for i in range(len(myTexts)):
     for j in range(len(myTexts)):
         similarityValue = LCS(myTexts[i], myTexts[j])
         if similarityValue >= someThreshold:
             resultantPairs.append((myTexts[i], myTexts[j], similarityValue))
         else: #keeping a flag (-1) 
             resultantPairs.append((myTexts[i], myTexts[j], -1))

So, I need to apply a nested loop of O(n^2) complexity on the same array (myTexts). However, I cannot find any efficient way for implementing the same code in pyspark rdd or dataframe as they do not support direct looping unlike sequential approach (as above codes).

Searching on the web I found one possible way for nested looping on rdd by applying cartesian product. However, I found the cartesian operation on rdd or dataframe very slow. Following is my current pyspark code using the cartesian product on rdd:

myTexts = sc.parallelize(['abc', 'cde', 'ccc', 'efg', 'eee', 'kkk'])
#following operation is very computation expensive
cartesianTexts = myTexts.cartesian(myTexts)

def myFunction(x):
    similarityValue = LCS(x[0], x[1])
    if similarityValue>= someThrehold:
         return (x[0], x[1], similarityValue)
    else: #keeping a flag (-1) 
         return (x[0], x[1], -1)

resultantPairs = cartesianTexts.map(myFunction)

The above implementation takes too much time even for relatively smaller dataset. It would be really great if you please suggest some ways to speed up the pyspark code.

Your requirements stated here necessitate not only O(n^2) computational complexity but also O(n^2) data transfer on the spark cluster. This cannot be made "efficient". If you can come up with a way to do this in something like O(N log(N)) computation/data transfer, you might have a chance. — Bi Rico, Aug 02 '18 at 02:09
I'd have to know more about your use case. You might be able to do something like the [approximate similarity join of Spark's LSH](https://spark.apache.org/docs/latest/ml-features.html#approximate-similarity-join). You could also think about some sort of kd-tree, but I haven't really thought through how that would work. — Bi Rico, Aug 02 '18 at 03:40
I tried as you linked to the answer, however, it is also significantly slow for even a medium-size dataset (e.g., in my case 2886 rows) @user8371915 — Code01, Aug 02 '18 at 22:01

Efficient way for nested loop on a pyspark rdd

0 Answers0