how to do intersection of list columns with pyspark dataframe columns?

Question

I have a below pyspark dataframe and i need to create new column (new_col) which is common items in column X and Y excluding items in Z.

df

id X             Y                    Z            new_col
1 [12,23,1,24]  [13,412,12,23,24]     [12]         [23,24]
2 [1,2,3]       [2,4,5,6]             []           [2]

score 8 · Accepted Answer · answered Jul 23 '19 at 15:51

If your schema is the following:

df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# |    |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# |    |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# |    |-- element: long (containsNull = true)

and your pyspark version 2.4+ you can use array_intersect and array_except:

from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X              |Y                    |Z   |new_col |
#+---+---------------+---------------------+----+--------+
#|1  |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2  |[1, 2, 3]      |[2, 4, 5, 6]         |[]  |[2]     |
#+---+---------------+---------------------+----+--------+

thanks pault, is there any way to include X or Y if one of them is null as well and then subtract Z? — user3222101, Jul 24 '19 at 13:55
Use [`size`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size) to check if the array is empty along with [`when`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when) to do the [pyspark equivalent of if-then-else](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else). — pault, Jul 24 '19 at 14:16

Sequinex · Answer 2 · 2019-07-23T13:25:44.350

0

You can use withcolumn + udf

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType

def intersection_function(list1, list2):
    intersection_list = [value for value in list1 if value in list2]
    return intersection_list

udf_intersection = F.udf(intersection_function, ArrayType(IntegerType())

newdf = df.withColumn("new_col", udf_intersection(df["ListColumn1"], df["ListColumn2"]))

edited Jul 23 '19 at 13:25

answered Jul 23 '19 at 10:56

Sequinex

641
1
9
17

Yes, look at the answer from @pault – Sequinex Jul 24 '19 at 08:11

how to do intersection of list columns with pyspark dataframe columns?

2 Answers2