How do I aggregate array elements column-wise in pyspark?

Question

col1	col2
[1,2,3,4]	[0,1,0,3]
[5,6,7,8]	[0,3,4,8]

desired result:

col1	col2
[6,8,10,12]	[0,4,4,11]

In snowflake's snowpark this is relatively straight forward using array_construct. Apache Spark has a similar array function but there is a major difference.

In snowpark, I can do array_construct(count('*'), sum(col('x')), sum(col('y'), count(col('y'))) but apache spark seems to count array() as an aggregation and complains that I can't have an aggregation inside of an aggregation.

pyspark.sql.utils.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;

I'm trying to write a piece of code that can handle both snowpark and apache spark but this array_construct vs array is proving trickier than anticipated. Next up is to explore doing a groupby & collect_list but wondering how others have solved this?

Does this answer your question? [pyspark - aggregate (sum) vector element-wise](https://stackoverflow.com/questions/54354915/pyspark-aggregate-sum-vector-element-wise) — David, Jan 31 '23 at 04:23

score 1 · Answer 1 · answered Jan 30 '23 at 22:53

1

If you treat the aggregations as delimited strings, you can split them to an array

select a, split(concat_ws(',',count(b),sum(c)),',')
from t
group by a;

answered Jan 30 '23 at 22:53

Radagast

5,102
3
12
27

whisperstream · Answer 2 · 2023-01-30T23:24:53.360

Turns out it's not the difference between array functions at all, it was a completely different issue.

I have a translation function, that translates snowpark's regexp_count function to pyspark using the formula

    def regexp_count(self, col, regex):
        return F.sum(F.when(col.rlike(regex), 1).otherwise(0))

This causes issues later on when I did F.sum(regexp_count(col, regex)) hence the error, because as becomes obvious, there is an F.sum inside an F.sum :/

Instead, the following translation fixes the issue

    def regexp_count(self, col, regex):
        return F.size(F.split(col, regex)) - 1

How do I aggregate array elements column-wise in pyspark?

2 Answers2