How to use BitwiseOR operation in groupBy aggregate function

Question

How can I use bitwiseOR as a aggregation function in pySpark Dataframe.groupBy, is there a inbuilt function like sum, which can do this for me?

Spark 2.4 allows for higher order functions like [`aggregate`](https://spark.apache.org/docs/latest/api/sql/index.html#aggregate). You should be able to do this without a `udf`. — pault, Aug 22 '19 at 15:59
@pault I understand but is there a performance overhead of using udf? — Arjun Ahuja, Aug 23 '19 at 08:45
Yes. [Spark functions vs UDF performance?](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance) — pault, Aug 23 '19 at 11:27

score 1 · Accepted Answer · answered Aug 22 '19 at 08:34

There is no built-in Bitwise OR aggregation function in Pyspark.

If your column is Boolean, you can simply use df.agg(F.sum('colA'))

Otherwise, you'll have to make a custom aggregation function.

There are three ways :

1 - The fastest is to implement a custom aggregation function in Scala called by Pyspark.

2 - Use an UDF :

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

def bitwiseOr(l):
    return reduce(lambda x,y: x | y, l)  # In Python 3, use `from functools import reduce`

udf_bitwiseOr = F.udf(bitwiseOr, IntegerType())
df.agg(udf_bitwiseOr(F.collect_list('colA'))).show()

3 - Use an RDD :

seqOp = (lambda local_result, row: local_result | row['colA'] )
combOp = (lambda local_result1, local_result2: local_result1 | local_result2)
rdd = df.rdd
rdd.aggregate(0, seqOp, combOp)

Method 2 and 3 share similar performances

Thanks Pierre for the detailed answer. – Arjun Ahuja Aug 22 '19 at 09:20 — Arjun Ahuja, Aug 22 '19 at 09:20

How to use BitwiseOR operation in groupBy aggregate function

1 Answers1