I'm looking for efficient way of applying some map function to each pair of elements in a dataframe. e.g.
records = spark.createDataFrame(
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], \
['id', 'val'])
records.show()
+---+---+
| id|val|
+---+---+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+---+---+
I want to take values a, b, c, d and compare each of them with all the rest:
a -> b
a -> c
a -> d
b -> c
b -> d
c -> d
By comparison I mean custom function that takes those 2 values and calculates some similarity index between them. Could you suggest efficient way to perform this calculation, assuming input dataframe could contain tenth millions elements?
Spark version 2.4.6 (AWS emr-5.31.0), using EMR notebook with pyspark