I have a DataFrame containing a column with strings. I want to find similar strings and mark them with some flag. I am using the function ratio from python-Levenshtein module and want to mark strings having a ratio more than 0.90 as "similar". The following is an example of the DataFrame I have:
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat"),
(3, "Logistic,regression,model,are,neat")
], ["id", "sentence"])
The desired output is:
+---+-----------------------------------+------------+
|id |sentence |similar_flag|
+---+-----------------------------------+------------+
|0 |Hi I heard about Spark | |
|1 |I wish Java could use case classes | |
|2 |Logistic regression models are neat|2_0 |
|3 |Logistic regression model is neat |2_1 |
|4 |Logistics regression model are neat|2_2 |
+---+-----------------------------------+------------+
Where "2_1" means "2" is the "id" of the reference string (first unique string used for matching) and "1" represents the first string that matches with it. I want to avoid for-loops completely. For smaller data, I have used for-loop to achieve the desired result in simple python and want to have same results in PySpark as well, hence I do not want to use any module other than python-Levenshtein. I have come across this approach, but it requires me to give up python-Levenshtein module. Also my DataFrame is likely to be huge (and expected to grow everyday), so this approach might cause memory errors. Is there a better way to achieve the desired result?