i have a dataframe df1
id transactions
1 [1, 3,3,3,2,5]
2 [1,2]
root
|-- id: int (nullable = true)
|-- transactions: array (nullable = false)
|-- element: string(containsNull = true)
None
i have a dataframe df2
items cost
[1, 3,3, 5] 2
[1, 5] 1
root
|-- items: array (nullable = false)
|-- element: string (containsNull = true)
|-- cost: int (nullable = true)
None
i have to check whether the items are in transactions, if so sum up the costs. [1,3,3,5] in [1,3,3,3,5] is True and [1,3,3,5] in [1,2] is False and so on.
result should be
id transactions score
1 [1,3,3,3,5] 3
2 [1,2] null
I tried explode and join (inner, left_semi), methods but it all fails because of duplicates. Check all the elements of an array present in another array pyspark issubset(), array_intersect() also won't work.
I came across Python - verifying if one list is a subset of the other. I found the following solves the problem and much efficient.
from collections import Counter
not Counter([1,3,3,3,5])-Counter([1,3,3,4,5])
False
>>> not Counter([1,3,3,3,5])-Counter([1,3,3,5])
False
>>> not Counter([1,3,3,5])-Counter([1,3,3,3,5])
True
i tried the following
@udf("boolean")
def contains_all(x, y):
if x is not None and y is not None:
return not (lambda y: dict(Counter(y)))-(lambda x: dict(Counter(x)))
(df1
.crossJoin(df2).groupBy("id", "transactions")
.agg(sum_(when(
contains_all("transactions", "items"), col("cost")
)).alias("score"))
.show())
but it throws error. File "", line 39, in contains_all TypeError: unsupported operand type(s) for -: 'function' and 'function'
any other way to achieve this?