Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.
Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.
My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:
- If the action occurred between the first and last date of a certain session, I would like to fill the missing value with the id of that session.
- I would mark as '0' the session where the action occurred outside the range of any session -
- and mark it as '-99' if it is not possible to associate the event to a single session, because it occurred during the time range of different session.
To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | Nan |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | Nan |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | Nan |
...
[Image_Timeline example][1]
This is what I would like to obtain:
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | -99 |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | b |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | 0 |
...
In this way I will be able to recover at least some of the events without session code. I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:
foo['last'] = foo.groupby('sess_id')['DATE'].transform(max)
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min)
And then use first-last time value to check whether each event whose session id is unknown falls withing that range.