Pyspark groupby and return the whole row

Question

I have the following dataframe named df:

name    score    data1    data2
Name1   23       'a'      'b'
Name2   2        'c'      'd'
Name1   100      'b'      'f'
Name3   5        'a'      'f'
Name2   6        'e'      'g'
Name3   500      'f'      'n'

The desired result:

Find max score for each name and return the associated data (data1 and data 2) as a dict:

grouped_df = df.groupby('name').agg({'score':'max'})

This produces:

name   score 
Name1  100
Name2  6
Name3  500

But what I would ideally like to get is

name   score  data1  data2
Name1  100    'b'    'f'
Name2  6      'e'    'g'
Name3  500    'f'    'n'

Or:

dict = {'b':1, 'e':1,'f':2,'g':1}

I attempted joining df and grouped_df by 2 columns (name and score), but that doesn't keep only the max, it returns all the columns. Is there any acceptable solution for this using dataframes?

score 3 · Accepted Answer · answered Apr 25 '17 at 13:24

3

You'll need to join the two dataframes

joinedData = grouped_df.join(df, ['name', 'score'])

answered Apr 25 '17 at 13:24

David

11,245
3
41
46

How do I make it choose only one row if it returns multiple rows with the same name and same score? I added .dropDuplicates() to drop duplicate rows, but I don't know how do I force it to choose one row when 2 or more rows return same name and score but different data fields? And how do I collect multiple columns and transform to dict? – user201411 Apr 25 '17 at 13:44
Yup. There will be potential duplicates if two names have the same score. `dropDuplicates('name')` is what you're looking for, I think. It will sure that that no two rows have the same name. – David Apr 25 '17 at 13:48

Pyspark groupby and return the whole row

1 Answers1

Linked