1

I have a timeseries dataset containing scores on scales of depression, anxiety, and trauma for patients. Data was collected at 6 time points for each patient.

mh_data.head(10)
ID    BDI   GAD   TSQ  age
1     57     9     4    22
1     36     9     4    22
1     37     9     4    22
1     38     7     3    22
1     41     8     3    22
1     39     7     3    22
2     29     14    7    35
2     27     12    6    35
2     27     11    6    35
2     23     11    3    35

I want to create a new dataset where each patient has only 1 corresponding value for each of the variables which represents the difference between the first last and recorded data point. So, it will look like this:

ID    BDI   GAD   TSQ  age
1     18     2     1    22
2     1      0     2    35
.     .      .     .    .
.     .      .     .    .
.     .      .     .    .

I've grouped the data and aggregated by first and last scores:

mhs_agg = mhs_data.groupby("ID").agg(['first','last']) 

How can I proceed or is there a more efficient way of doing this? I also have age which is a variable I don't want to be computing the difference for (as this will come out as 0 for everyone).

I've seen all of the following posts and none of the suggestions seem to work for my specific case.

How to apply "first" and "last" functions to columns while using group by in pandas?

Python/Pandas - Aggregating dataframe with first/last function without grouping

Pandas DataFrame groupby two columns and get first and last

ummendial
  • 83
  • 9

3 Answers3

3

You can pass arbitrary functions into .agg(). When you do this, your function receives each pandas series 1 at a time, already filtered down to an individual "group".

I'd probably do the whole calculation as one function instead of multiple separate steps.

import pandas as pd

my_data = pd.DataFrame(
    {'id': [1, 1, 1, 2, 2, 2], 
     'b': [0, 2, 5, 2, 4, 8],
     'c': [5, 8, 1, 4, 2, 1]})

def diff_calc(x):
    x1 = x[:1].values # Get first value
    x2 = x[-1:].values # Get last value
    
    return abs(x1 - x2)

my_data.groupby('id').agg(diff_calc)

Input

id b c
1 0 5
1 2 8
1 5 1
2 2 4
2 4 2
2 8 1

Output

id b c
1 5 4
2 6 3

ps: You may want to use .reset_index() at the end as pandas groupBy is a bit weird, and that may have been the original issue you were running into.

  • Hey Tyler, thanks for the response. But, in this case then I would just be calculating the difference in the first and last value for each column not aggregated by an ID or some other unique identifier, right? – ummendial Jun 05 '21 at 23:55
  • In my example above, the diff_calc gets called 4 times with these inputs: [0, 2, 5], [2, 4, 8], [5, 8, 1], [4, 2, 1]. Let me know if I missed something in the requirements ... but I think that would be aggregating for each value of "id"? – Tyler Rosacker Jun 06 '21 at 00:08
  • Ah yeah sorry I must've missed it myself. Thanks! – ummendial Jun 06 '21 at 10:20
2

try:

df1 = df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('first')-df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('last')
df_final = df1.merge(df[['ID','age']].groupby('ID').agg('first'), on='ID')


    BDI  GAD  TSQ  age
ID
1    18    2    1   22
2     6    3    4   35

Second option using lambda to get the first part, then merge

df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID', as_index=False).apply(lambda x: x.groupby('ID').agg('first')-x.groupby('ID').agg('last'))
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
  • Hey Jonathan, thanks for the response. This works just fine, could I incorporate the absolute function to get the absolute difference in this case though? – ummendial Jun 05 '21 at 23:54
  • 1
    I've not used the absolute on difference in dataframes, but assume it could be done. The second option I just added to the answer using lambda might be easier (use absolute around the two agg functions) – Jonathan Leon Jun 05 '21 at 23:57
  • Yep that works just fine thanks. One last question, the "id" columns is no longer being counted as column. So, when I run df.shape I get 1327 rows, which corresponds to the number of distinct IDs/patients; however, I only get 4 columns. I tried running .reset_index().set_index("id), but this didn't work. Any suggestions? – ummendial Jun 06 '21 at 02:07
  • 1
    should just be .reset_index(), your code is resetting the index then setting back to 'id' – Jonathan Leon Jun 06 '21 at 02:14
  • I tried running df_final.reset_index(), am I supposed to be using it elsewhere? – ummendial Jun 06 '21 at 02:20
  • 1
    the nuances of pandas....you can tack reset_index() at the end of the second when df_final is created OR df_final.reset_index(inplace=True) OR df_final = df_final.reset_index() – Jonathan Leon Jun 06 '21 at 02:24
2

This should work:

df.groupby('ID').agg(lambda x: x.iloc[0] - x.iloc[-1])
rhug123
  • 7,893
  • 1
  • 9
  • 24