I have the following problem which as I read, it's mentioned as the 'gaps and islands' problem. A sample of my data are like the below (I excluded some columns related to the user also, but I suppose the approach would be the same):
user | reports_to | date |
---|---|---|
john | mark | 1/1/2021 |
john | mark | 2/1/2021 |
john | mark | 3/1/2021 |
john | mark | 4/1/2021 |
john | mark | 5/1/2021 |
john | mark | 6/1/2021 |
john | mark | 7/1/2021 |
john | mark | 8/1/2021 |
john | mark | 9/1/2021 |
john | mark | 10/1/2021 |
john | mark | 11/1/2021 |
john | nick | 12/1/2021 |
john | nick | 13/1/2021 |
john | nick | 14/1/2021 |
john | nick | 15/1/2021 |
john | nick | 16/1/2021 |
john | nick | 17/1/2021 |
john | nick | 18/1/2021 |
john | nick | 19/1/2021 |
john | nick | 20/1/2021 |
john | mark | 21/1/2021 |
john | mark | 22/1/2021 |
john | mark | 23/1/2021 |
john | mark | 24/1/2021 |
john | mark | 25/1/2021 |
My dataset contains thousands of rows and couple more columns that are related to the characteristics of the user. That pattern may exist in multiple users.
When I try to find the min and max date that a user had that manager using a simple group by, I get the following result:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 25/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
Logically, that result is wrong. The desired result would be the following:
user | reports_to | from | to |
---|---|---|---|
john | mark | 1/1/2021 | 11/1/2021 |
john | nick | 12/1/2021 | 20/1/2021 |
john | mark | 21/1/2021 | 25/1/2021 |
My project is on python and I try to solve it using window_functions or rank functions like the following
df["rank"] = df.groupby(['user','reports_to'])["date"].rank("dense", ascending=True)
in order to compare the rank vs the previous date, but I still cannot solve that problem. Which approach would be the best and thank you in advance for your time.