1
col1 col2
[1,2,3,4] [0,1,0,3]
[5,6,7,8] [0,3,4,8]

desired result:

col1 col2
[6,8,10,12] [0,4,4,11]

In snowflake's snowpark this is relatively straight forward using array_construct. Apache Spark has a similar array function but there is a major difference.

In snowpark, I can do array_construct(count('*'), sum(col('x')), sum(col('y'), count(col('y'))) but apache spark seems to count array() as an aggregation and complains that I can't have an aggregation inside of an aggregation.

pyspark.sql.utils.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;

I'm trying to write a piece of code that can handle both snowpark and apache spark but this array_construct vs array is proving trickier than anticipated. Next up is to explore doing a groupby & collect_list but wondering how others have solved this?

whisperstream
  • 1,897
  • 3
  • 20
  • 25
  • Does this answer your question? [pyspark - aggregate (sum) vector element-wise](https://stackoverflow.com/questions/54354915/pyspark-aggregate-sum-vector-element-wise) – David Jan 31 '23 at 04:23

2 Answers2

1

If you treat the aggregations as delimited strings, you can split them to an array

select a, split(concat_ws(',',count(b),sum(c)),',')
from t
group by a;
Radagast
  • 5,102
  • 3
  • 12
  • 27
0

Turns out it's not the difference between array functions at all, it was a completely different issue.

I have a translation function, that translates snowpark's regexp_count function to pyspark using the formula

    def regexp_count(self, col, regex):
        return F.sum(F.when(col.rlike(regex), 1).otherwise(0))

This causes issues later on when I did F.sum(regexp_count(col, regex)) hence the error, because as becomes obvious, there is an F.sum inside an F.sum :/

Instead, the following translation fixes the issue

    def regexp_count(self, col, regex):
        return F.size(F.split(col, regex)) - 1
whisperstream
  • 1,897
  • 3
  • 20
  • 25