2

Apologies if my question has been answered before, or the answer is obvious.

Let's say that in my dataset there are two tasks, 20 different trials each. Now I would like to select only last 6 seconds of each trial for further analysis.

The dataset looks sort of like this (+more columns). This sample covers all 20 trials of one task. Index values are as in the full dataset, time is given in unix timestamps (ms).

index   time                x           y          Trial_Id
13512   1519227368636.0000  1022.0000   602.0000    1
13513   1519227368683.0000  1019.0000   697.0000    1
13514   1519227368728.0000  966.0000    530.0000    1
13515   1519227368752.0000  961.0000    576.0000    1
13516   1519227368806.0000  1120.0000   631.0000    1
...
17076   1519227518503.0000  804.0000    694.0000    20
17077   1519227518549.0000  789.0000    738.0000    20
17078   1519227518596.0000  809.0000    747.0000    20
17079   1519227518678.0000  806.0000    735.0000    20
17080   1519227518713.0000  823.0000    605.0000    20

On the level of single trial iloc does the job. However, when I try to apply iloc on the data grouped by trial_Id, I get the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed.

The code I use:

a function to preserve the last 6 seconds

def img_trial(data, start): data1 = data.iloc[start:-1,:] return data1

a function application on data grouped by trial

data.groupby(['Trial_Nr']).apply(img_trial(data, 80))

Can you please give me a hint on what's wrong? I'm quite a pandas-newbie. Sorry if my question is not clear enough (that's the first post of a long time lurker).

Best regards,

Nat

user396156
  • 33
  • 5
  • 2
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Feb 26 '18 at 11:08
  • If your time is to second resolution then you could do something like: `df.sort_values('time_column').groupby('Trial_Nr').tail(6)` - but it's hard to say until you include the necessary information in the post as per @jezrael's comment... – Jon Clements Feb 26 '18 at 11:32
  • thanks for the hints! I hope that the last edit clarifies my question a bit – user396156 Feb 26 '18 at 12:51

1 Answers1

1

I believe you need boolean indexing with transform for max value per group of datetime:

print (df)
    index                time       x      y  Trial_Id
8   13515  1519227361052.0000   961.0  576.0         1
7   13514  1519227362028.0000   966.0  530.0         1
5   13512  1519227363636.0000  1022.0  602.0         1
4   13516  1519227364806.0000  1120.0  631.0         1
3   13515  1519227365752.0000   961.0  576.0         1
2   13514  1519227366728.0000   966.0  530.0         1
1   13513  1519227367683.0000  1019.0  697.0         1
9   13516  1519227368906.0000  1120.0  631.0         1
6   13513  1519227369283.0000  1019.0  697.0         1
0   13512  1519227369836.0000  1022.0  602.0         1
10  17076  1519227518503.0000   804.0  694.0        20
11  17077  1519227518549.0000   789.0  738.0        20
12  17078  1519227518596.0000   809.0  747.0        20
13  17079  1519227518678.0000   806.0  735.0        20
14  17080  1519227518713.0000   823.0  605.0        20

#convert column time to datetime
df['time'] = pd.to_datetime(df['time'].astype(float), unit='ms')

#get max date per group
max_per_group = df.groupby('Trial_Id')['time'].transform('max') 
#subtract 6 seconds
diff_6_sec = max_per_group - pd.Timedelta(6, unit='s')
#filter
df = df[diff_6_sec < df['time']]
print (df)
    index                    time       x      y  Trial_Id
4   13516 2018-02-21 15:36:04.806  1120.0  631.0         1
3   13515 2018-02-21 15:36:05.752   961.0  576.0         1
2   13514 2018-02-21 15:36:06.728   966.0  530.0         1
1   13513 2018-02-21 15:36:07.683  1019.0  697.0         1
9   13516 2018-02-21 15:36:08.906  1120.0  631.0         1
6   13513 2018-02-21 15:36:09.283  1019.0  697.0         1
0   13512 2018-02-21 15:36:09.836  1022.0  602.0         1
10  17076 2018-02-21 15:38:38.503   804.0  694.0        20
11  17077 2018-02-21 15:38:38.549   789.0  738.0        20
12  17078 2018-02-21 15:38:38.596   809.0  747.0        20
13  17079 2018-02-21 15:38:38.678   806.0  735.0        20
14  17080 2018-02-21 15:38:38.713   823.0  605.0        20

For better understanding is possible concat all together for check:

print (pd.concat([df['time'], max_per_group, diff_6_sec], 
                 axis=1, 
                 keys=('orig', 'max', 'sub_6s')))

                      orig                     max                  sub_6s
8  2018-02-21 15:36:01.052 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
7  2018-02-21 15:36:02.028 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
5  2018-02-21 15:36:03.636 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
4  2018-02-21 15:36:04.806 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
3  2018-02-21 15:36:05.752 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
2  2018-02-21 15:36:06.728 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
1  2018-02-21 15:36:07.683 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
9  2018-02-21 15:36:08.906 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
6  2018-02-21 15:36:09.283 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
0  2018-02-21 15:36:09.836 2018-02-21 15:36:09.836 2018-02-21 15:36:03.836
10 2018-02-21 15:38:38.503 2018-02-21 15:38:38.713 2018-02-21 15:38:32.713
11 2018-02-21 15:38:38.549 2018-02-21 15:38:38.713 2018-02-21 15:38:32.713
12 2018-02-21 15:38:38.596 2018-02-21 15:38:38.713 2018-02-21 15:38:32.713
13 2018-02-21 15:38:38.678 2018-02-21 15:38:38.713 2018-02-21 15:38:32.713
14 2018-02-21 15:38:38.713 2018-02-21 15:38:38.713 2018-02-21 15:38:32.713
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can you however please elaborate on 'lambda x: x.max() - pd.Timedelta(6, unit='s')'? you shift the max value by 6 seconds down? And then: df = df[s < df['time']] this removes all datetimes that are larger than shifted datetimes in the series? I triple checked and it does the job but I am still confused as to how. Sorry if this is obvious – user396156 Feb 26 '18 at 14:31
  • @user396156 - sure, also I have idea for better answer. – jezrael Feb 26 '18 at 14:39