pySpark sub dataframes using groupBy()

Question

I have a pySpark dataframe and want to make a several sub dataframes using groupBy operation. For example, I have a DF like

       subject  relation object 
DF =      s1       p       o1
          s2       p       o2
          s3       q       o3
          s4       q       o4

and want to have a sub dataframes with same relation names like

       subject  relation object 
DF1 =      s1       p       o1
           s2       p       o2
       subject  relation object 
DF2 =      s3       q       o3
           s4       q       o4

I would be appreciated if you can share your idea how to make sub dataframes using groupBy().

Thanks

score 0 · Answer 1 · answered Dec 25 '19 at 09:07

You can groupy and the create a list like this

df_groupby = DF.groupby('relation')

df_list = []
for row in df_groupby.select('relation').distinct().sort('relation').collect(): 
    current_relation = row['relation']
    df_list.append(df_groupby.filter(df_groupby['relation'] == current_relation))

pySpark sub dataframes using groupBy()

1 Answers1