2

I have a pandas dataframe like the following:

df = pd.DataFrame({'Person_ID': [1,1,1,1,1,1,2,2,2,3,3,3,3],
                  'Item_ID': [1,1,2,4,4,4,2,3,3,1,2,2,2],
                   'Value': [1,4,6,5,8,7,3,2,9,8,4,1,2]})

I would like to group this by both person_id and item_id, and then get the average value of each first row. For example: Person_ID 1 is 'associated' with 'Item_ID's 1,2 and 4. The values of the first entries for this Person_ID and these groups are 1,6 and 5 respectively. The average of this would be 4.

    Person_ID   Item_ID Value
0   1   1   1
1   1   1   4
2   1   2   6
3   1   4   5
4   1   4   8
5   1   4   7
6   2   2   3
7   2   3   2
8   2   3   9
9   3   1   8
10  3   2   4
11  3   2   1
12  3   2   2

Desired outcome:

Person_ID Average_value_first_entries
1   4
2   2.5
3   6

I noticed this question was similar, but a more complex version of this question: Pandas dataframe get first row of each group In this case, instead of grouping by one 'id', I would like to group by two 'id's and take the average.

I tried the following:

df.groupby(['Person_ID', 'Item_ID']).first()['Value']

However, this returns each 'first' entry, but not the averages.

Person_ID  Item_ID
1          1          1
           2          6
           4          5
2          2          3
           3          2
3          1          8
           2          4
Name: Value, dtype: int64

Any help would be appreciated. I would be especially grateful for 'computationally efficient' (time) solutions, since the real dataset contains 1.2 million rows.

bprsch
  • 67
  • 6

3 Answers3

2

After you find out the first row per group of Person_ID and Item_ID, you need to group again on Person_ID to find the mean. Like this:

In [1312]: d = df.groupby(['Person_ID','Item_ID'], as_index=False).head(1)

In [1315]: d.groupby('Person_ID', as_index=False)['Value'].mean()
Out[1315]: 
   Person_ID  Value
0          1    4.0
1          2    2.5
2          3    6.0

OR in one-line as suggested by @Datanovice:

In [1320]: df.groupby(['Person_ID','Item_ID']).first().groupby(level=0)['Value'].mean().to_frame().reset_index()
Out[1320]: 
   Person_ID  Value
0          1    4.0
1          2    2.5
2          3    6.0
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
2

We can do drop_duplicates

df.drop_duplicates(['Person_ID','Item_ID']).groupby(['Person_ID']).Value.mean()
Person_ID
1    4.0
2    2.5
3    6.0
Name: Value, dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234
1

we can combine groupby with unstack

(df.groupby(['Person_ID','Item_ID'])
 .first()
 .unstack()
 .mean(1)
 .reset_index(name='Average_value_first_entries')
  )

    Person_ID   Average_value_first_entries
0       1               4.0
1       2               2.5
2       3               6.0
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • While it is usefull that this code includes a line that renames the 'output column', I found that this approach was slightly less robust compared to the other solutions. In the real dataframe I have a number of other columns (not relevant for this problem); the code above would need to be adapted to select the correct ('Value') column; e.g.: `df[['Person_ID','Item_ID', 'Value']].groupby(['Person_ID','Item_ID']).first().unstack().mean(1).reset_index(name='Average_value_first_entries')` – bprsch May 21 '20 at 09:41