1

I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?

 record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]


for field in record_fields: 
    df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
    ### df write to csv operation

My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:

TypeError: Invalid argument, not a string or column: ['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

How do I make this work? I'm open to other ways I could do this.

LaSul
  • 2,231
  • 1
  • 20
  • 36
algorythms
  • 1,547
  • 1
  • 15
  • 28

1 Answers1

2

Try this (note that * [asterisk] before field):

for field in record_fields: 
    df_group = df.groupBy('year', 'area', 'state', 'code', *field).sum('net_contributions')

Also take a look at this question to know more about asterisk in python.

Ala Tarighati
  • 3,507
  • 5
  • 17
  • 34