How to modify a Data to remove ids within certain bounds of another time veriable

Question

I have a dataframe looks like :

id     TakingTime
1       03-01-2015
1       18-07-2015
1       22-10-2015
1       14-01-2016

2       11-02-2015
2       28-02-2015
2       18-04-2015
2       19-05-2015

3       11-02-2015
3       16-11-2015
3       19-02-2016
3       21-04-2016

4       03-01-2015
4       03-01-2015
4       03-01-2015
4       03-01-2015

The output desired is :

id     TakingTime
1       03-01-2015
1       18-07-2015
1       22-10-2015
1       14-01-2016

3       11-02-2015
3       16-11-2015
3       19-02-2016
3       21-04-2016

When I want to remove all id which have a difference time between the first and last taking time one year minimum.

I tried with

df[df.groupby('ID')['takingtime'].transform(lambda x: x.nunique() > 1)]

But I'm not sure if it's the right way to do this and if yes what is meaning of > 5 ? Days, Months, Years ... ?

`nunique` returns a `Series` containing *counts* of unique occurrences. I think what you want is this: https://stackoverflow.com/questions/54470917, where column `A` is `ID` and `B` is `TakingTime`. If not, please describe intended output. — costaparas, Dec 13 '20 at 03:32
Also, in future, tag questions like this with [dataframe](https://stackoverflow.com/questions/tagged/dataframe) and [pandas](https://stackoverflow.com/questions/tagged/pandas) to make it more accessible. — costaparas, Dec 13 '20 at 03:33
When using ```df.loc[df.groupby('A').B.idxmin()]```, how can I define a desired minimum time ? — David, Dec 13 '20 at 06:36
I'm not sure what you mean by a minimum time. And, this sentence "And I'd like to remove all id which has a difference time between the first and last taking time under X time (eg: 5 months, 2 years etc..)" is confusing. Could you please [update](https://stackoverflow.com/posts/65270999/edit) your question with the intended output . — costaparas, Dec 13 '20 at 06:41

Cainã Max Couto-Silva · Accepted Answer · 2020-12-13T08:11:50.833

0

Use:

idx = df.groupby('id').TakingTime.transform(lambda x: x.dt.year.diff().sum().astype(bool))
df[idx]

Output:

    id TakingTime
0    1 2015-03-01
1    1 2015-07-18
2    1 2015-10-22
3    1 2016-01-14
8    3 2015-11-02
9    3 2015-11-16
10   3 2016-02-19
11   3 2016-04-21

Explanation:

For each id, take the difference across the years. If there's any difference greater than 0 (i.e. sum().astype(bool)), returns True. We used transform to replicate the output for the whole group. Finally, slice the dataframe with the output indexes.

Edit:

To analyze a specific amount of time (in days):

days = 865
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()).days >= days)

or:

from datetime import timedelta
days = timedelta(865)
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()) >= days)

edited Dec 13 '20 at 08:11

answered Dec 13 '20 at 07:28

Cainã Max Couto-Silva

4,839
1
11
35

Thank you for that vital explanation. Just, one year here was an exemple so if I want to use another time-threshold like 2 years, 45 days or 3 months, it's possible to implement that ? – David Dec 13 '20 at 07:50
Yeah! In this case, first convert your time threshold to days, as datetime difference returns the difference in days. So, for example, 2 years, 3 months and 45 days = (365 * 2) + (30 * 3) + 45 = 865. Then use it in the code I provide above (see udpate). – Cainã Max Couto-Silva Dec 13 '20 at 08:10

How to modify a Data to remove ids within certain bounds of another time veriable

1 Answers1