How to use groupby with array elements in Pyspark?

Question

I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?

 record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]


for field in record_fields: 
    df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
    ### df write to csv operation

My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:

TypeError: Invalid argument, not a string or column: ['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

How do I make this work? I'm open to other ways I could do this.

Do you want to groupBy all the column names in `record_fields` field along with state, year, code etc.. ? — Sivakumar, Feb 25 '20 at 09:03

score 2 · Accepted Answer · answered Feb 25 '20 at 11:56

2

Try this (note that * [asterisk] before field):

for field in record_fields: 
    df_group = df.groupBy('year', 'area', 'state', 'code', *field).sum('net_contributions')

Also take a look at this question to know more about asterisk in python.

answered Feb 25 '20 at 11:56

Ala Tarighati

3,507
5
17
34

How to use groupby with array elements in Pyspark?

1 Answers1