I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?
record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
### df write to csv operation
My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:
TypeError: Invalid argument, not a string or column: ['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How do I make this work? I'm open to other ways I could do this.