How can I use bitwiseOR as a aggregation function in pySpark Dataframe.groupBy, is there a inbuilt function like sum, which can do this for me?
Asked
Active
Viewed 988 times
1
-
What version of pyspark? – pault Aug 22 '19 at 14:25
-
@pault I am looking at 2.4.0. – Arjun Ahuja Aug 22 '19 at 15:58
-
Spark 2.4 allows for higher order functions like [`aggregate`](https://spark.apache.org/docs/latest/api/sql/index.html#aggregate). You should be able to do this without a `udf`. – pault Aug 22 '19 at 15:59
-
@pault I understand but is there a performance overhead of using udf? – Arjun Ahuja Aug 23 '19 at 08:45
-
Yes. [Spark functions vs UDF performance?](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance) – pault Aug 23 '19 at 11:27
1 Answers
1
There is no built-in Bitwise OR aggregation function in Pyspark.
If your column is Boolean, you can simply use df.agg(F.sum('colA'))
Otherwise, you'll have to make a custom aggregation function.
There are three ways :
1 - The fastest is to implement a custom aggregation function in Scala called by Pyspark.
2 - Use an UDF :
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
def bitwiseOr(l):
return reduce(lambda x,y: x | y, l) # In Python 3, use `from functools import reduce`
udf_bitwiseOr = F.udf(bitwiseOr, IntegerType())
df.agg(udf_bitwiseOr(F.collect_list('colA'))).show()
3 - Use an RDD :
seqOp = (lambda local_result, row: local_result | row['colA'] )
combOp = (lambda local_result1, local_result2: local_result1 | local_result2)
rdd = df.rdd
rdd.aggregate(0, seqOp, combOp)
Method 2 and 3 share similar performances

Pierre Gourseaud
- 2,347
- 13
- 24