Compare each spark dataframe element with all the rest of same dataframe

Question

I'm looking for efficient way of applying some map function to each pair of elements in a dataframe. e.g.

records = spark.createDataFrame(
    [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], \
    ['id', 'val'])
records.show()

+---+---+
| id|val|
+---+---+
|  1|  a|
|  2|  b|
|  3|  c|
|  4|  d|
+---+---+

I want to take values a, b, c, d and compare each of them with all the rest:

a -> b
a -> c
a -> d
b -> c
b -> d
c -> d

By comparison I mean custom function that takes those 2 values and calculates some similarity index between them. Could you suggest efficient way to perform this calculation, assuming input dataframe could contain tenth millions elements?

Spark version 2.4.6 (AWS emr-5.31.0), using EMR notebook with pyspark

Does this post help? [New Dataframe column as a generic function of other rows (spark)](https://stackoverflow.com/questions/48174484/new-dataframe-column-as-a-generic-function-of-other-rows-spark) — pault, Nov 20 '20 at 16:33

Srinivas · Answer 1 · 2020-11-20T13:46:11.003

0

Collect val column values into lookup column. then compare each value from lookup array with val column.

Check below code.

>>> records\ 
.select(F.collect_list(F.struct(F.col("id"),F.col("val"))).alias("data"),F.collect_list(F.col("val")).alias("lookup"))\ 
.withColumn("data",F.explode(F.col("data"))) \
.select("data.*",F.expr("filter(lookup,v -> v != data.val)").alias("lookup")) \
#.withColumn("compare",expr("transform(lookup, v -> val [.....] )")) # May be you can add your logic in this -> [.....] 
.show()

+---+---+---------+
| id|val|   lookup|
+---+---+---------+
|  1|  a|[b, c, d]|
|  2|  b|[a, c, d]|
|  3|  c|[a, b, d]|
|  4|  d|[a, b, c]|
+---+---+---------+

edited Nov 20 '20 at 13:46

answered Nov 20 '20 at 13:40

Srinivas

8,957
2
12
26

This is not going to scale well with tens of millions of records. – pault Nov 20 '20 at 16:35
I agree on that, Its better than cross join :) – Srinivas Nov 20 '20 at 16:40

score 0 · Answer 2 · answered Nov 20 '20 at 21:34

This is a cross join operation with a collect_list aggregation. if you want a's matches list to contain only [b,c,d] you should apply that filter before doing the collect_list.

records.alias("lhs")
.crossJoin(episodes.alias("rhs"))
.filter("lhs.val!=rhs.val")
.groupBy("lhs")
.agg(functions.collect_list("rhs.val").alias("lookup"))
.selectExpr("lhs.*", "lookup");

Compare each spark dataframe element with all the rest of same dataframe

2 Answers2