1

Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.

Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.

My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:

  1. If the action occurred between the first and last date of a certain session, I would like to fill the missing value with the id of that session.
  2. I would mark as '0' the session where the action occurred outside the range of any session -
  3. and mark it as '-99' if it is not possible to associate the event to a single session, because it occurred during the time range of different session.

To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | Nan  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | Nan  | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | Nan  | 
...

[Image_Timeline example][1]

This is what I would like to obtain:

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | -99  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | b    | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | 0    | 
...

In this way I will be able to recover at least some of the events without session code. I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:

foo['last'] = foo.groupby('sess_id')['DATE'].transform(max) 
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min) 

And then use first-last time value to check whether each event whose session id is unknown falls withing that range.

EAMC
  • 45
  • 5

1 Answers1

0

Your intuition seems fine by me, but you can't apply it this way since your dataframe foo doens't have the same size as your groupby dataframe. What you could do is map the values like this:

foo['last'] = foo.sess_id.map(foo.groupby('sess_id').DATE.max())
foo['first'] = foo.sess_id.map(foo.groupby('sess_id').DATE.min())

But I don't think it's necessary, you can just use the groupby dataframe as such.

A way to solve your problem could be to look for the missing values in sess_id column, and apply a custom function to the corresponding dates:

def my_custom_function(time):
    current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
    count = len(current_sessions)
    if count == 0:
        return 0
    if count > 1:
        return -99
    return current_sessions.index[0]

my_agg = foo.groupby('sess_id').DATE.agg([min,max])
foo.loc[foo.sess_id.isnull(),'sess_id'] = foo.loc[foo.sess_id.isnull(),'DATE'].apply(my_custom_function)

Output:

    DATE                    sess_id
0   2018-01-01 00:19:01     a
1   2018-01-01 00:19:05     b
2   2018-01-01 00:21:07     a
3   2018-01-01 00:22:07     b
4   2018-01-01 00:25:09     c
5   2018-01-01 00:25:11     -99
6   2018-01-01 00:27:28     c
7   2018-01-01 00:29:29     a
8   2018-01-01 00:30:35     b
9   2018-01-01 00:31:16     b
10  2018-01-01 00:35:22     0

I think it performs what you are looking for, though the output you posted in your question seems to contain typos.

ysearka
  • 3,805
  • 5
  • 20
  • 41
  • Thank you for your help! Indeed, there was a small typo in the output, i'm going to fix it. Could I just ask you to clarify what the last return (return current_sessions.index[0]) does? – EAMC Aug 23 '18 at 14:49
  • No problem, in case that there is exactly one session which match the test (ie which begins before the given time and ends after), we retrieve its name. Indeed the data frame `my_agg` is given by a groupby on `sess_id` which means its index is composed of values from `sess_id`. When we only have one session in its index, `current_sessions.index[0]` allows us to retrieve this `sess_id` value. Is it clear? – ysearka Aug 24 '18 at 07:39
  • in `my_custom_function` I have some if statements. These are here to check how many sessions are open at the time we're looking at. The first `if count == 0` activates when no session is open, the second `if count > 1` is for when you have multiple open sessions, so that when you reach the last `return` you must have only one open session. – ysearka Aug 24 '18 at 14:03
  • It's clear. So, let's say that we have more session in the index, as more than just one match the test, what could we do? Maybe change to if count ==1 return current_session.index? In any case thank you for the clear explanation, I really appreciated it. – EAMC Aug 24 '18 at 14:08
  • It depends on what your are trying to return, I understood from your question that if you couldn't link the time to a unique session, you should get `-99` as a result. Am I wrong? – ysearka Aug 24 '18 at 14:13
  • No, you were right. Thank you for your teaching, they really helped me much. – EAMC Aug 28 '18 at 09:43