I have a simple python list of string as follows (myTexts). I want apply nested looping on the array and return only those pairs of strings that matches a particular threshold value. Following is the example of my existing python code:
myTexts = ['abc', 'cde', 'ccc', 'efg', 'eee', 'kkk']
someThrehold = 0.5
resultantPairs = []
def LCS(string1, string2):
#returns a numeric similarity value,x (within range [0,1]) based on
#passed strings: string1, string2
return x
for i in range(len(myTexts)):
for j in range(len(myTexts)):
similarityValue = LCS(myTexts[i], myTexts[j])
if similarityValue >= someThreshold:
resultantPairs.append((myTexts[i], myTexts[j], similarityValue))
else: #keeping a flag (-1)
resultantPairs.append((myTexts[i], myTexts[j], -1))
So, I need to apply a nested loop of O(n^2) complexity on the same array (myTexts). However, I cannot find any efficient way for implementing the same code in pyspark rdd or dataframe as they do not support direct looping unlike sequential approach (as above codes).
Searching on the web I found one possible way for nested looping on rdd by applying cartesian product. However, I found the cartesian operation on rdd or dataframe very slow. Following is my current pyspark code using the cartesian product on rdd:
myTexts = sc.parallelize(['abc', 'cde', 'ccc', 'efg', 'eee', 'kkk'])
#following operation is very computation expensive
cartesianTexts = myTexts.cartesian(myTexts)
def myFunction(x):
similarityValue = LCS(x[0], x[1])
if similarityValue>= someThrehold:
return (x[0], x[1], similarityValue)
else: #keeping a flag (-1)
return (x[0], x[1], -1)
resultantPairs = cartesianTexts.map(myFunction)
The above implementation takes too much time even for relatively smaller dataset. It would be really great if you please suggest some ways to speed up the pyspark code.