1

How can I use bitwiseOR as a aggregation function in pySpark Dataframe.groupBy, is there a inbuilt function like sum, which can do this for me?

Arjun Ahuja
  • 113
  • 2
  • 7

1 Answers1

1

There is no built-in Bitwise OR aggregation function in Pyspark.

If your column is Boolean, you can simply use df.agg(F.sum('colA'))

Otherwise, you'll have to make a custom aggregation function.

There are three ways :

1 - The fastest is to implement a custom aggregation function in Scala called by Pyspark.

2 - Use an UDF :

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

def bitwiseOr(l):
    return reduce(lambda x,y: x | y, l)  # In Python 3, use `from functools import reduce`

udf_bitwiseOr = F.udf(bitwiseOr, IntegerType())
df.agg(udf_bitwiseOr(F.collect_list('colA'))).show()

3 - Use an RDD :

seqOp = (lambda local_result, row: local_result | row['colA'] )
combOp = (lambda local_result1, local_result2: local_result1 | local_result2)
rdd = df.rdd
rdd.aggregate(0, seqOp, combOp)

Method 2 and 3 share similar performances

Pierre Gourseaud
  • 2,347
  • 13
  • 24