2

I have the folowing dataset:

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', 
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan,  'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07', 
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09', 
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29', 
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22', 
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35', 
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16', 
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}

#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] =  pd.to_datetime(df['date']) 

df.head()

     player session        date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1     NaN 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09

So, these are my three columns:

  1. 'player' - with three players (1,2,3) - dtype = object
  2. 'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
  3. 'date' (datetime object) tells us the time at which each action was implemented.

The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.

Let's say I groupby player & id, and compute the time range for each session:

my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg

                           min                 max
player session                                        
1      a       2018-01-01 00:19:05 2018-01-01 00:21:07
       b       2018-01-01 00:22:07 2018-01-01 00:25:09
       c       2018-01-01 00:25:11 2018-01-01 00:30:35
2      d       2018-01-01 00:21:16 2018-01-01 00:35:22
       e       2018-01-01 00:38:16 2018-01-01 01:31:16
3      f       2018-01-03 00:55:22 2018-01-03 00:58:16
       g       2018-01-03 00:58:21 2018-03-01 01:31:16

At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.

Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.

Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.

EAMC
  • 45
  • 5
  • 3
    Does `df.fillna(method='ffill')` not satisfy your requirements? In other words, will `np.nan` not always be preceded by it's desired `session` value? – rahlf23 Aug 30 '18 at 19:43
  • 1
    Do you need a more general solution where perhaps the missing value straddles two sessions, and you need to choose the nearest one? – ALollz Aug 30 '18 at 19:51
  • @rahlf23 No, np.nan is not preceded by the session value, it is actually missing it. What I want is to assign a session value to the np.nan based on the timestamps and the player_id – EAMC Aug 30 '18 at 20:34
  • @ALollz well, in the question I posted and the example provided I did not mention it, as for each missing value there is just a corresponding session id. However, as you suggest, it would be nice to have a solution where if a np.nan falls within the range of multiple sessions (and thus it is impossible to assign with 100% certainty to one session) the np.nan is labeled as -99 or something like that, so that it we know that it was impossible to find its original session id. – EAMC Aug 30 '18 at 20:41

1 Answers1

0

Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.

df.sort_values(['player','date']).fillna(method='ffill')

Yields:

   player session                date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1       b 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09
5       1       c 2018-01-01 00:25:11
6       1       c 2018-01-01 00:27:28
7       1       c 2018-01-01 00:29:29
8       1       c 2018-01-01 00:30:35
9       2       d 2018-01-01 00:21:16
10      2       d 2018-01-01 00:35:22
11      2       e 2018-01-01 00:38:16
12      2       e 2018-01-01 00:38:20
13      2       e 2018-01-01 00:40:35
14      2       e 2018-01-01 01:31:16
15      3       f 2018-01-03 00:55:22
16      3       f 2018-01-03 00:58:16
17      3       g 2018-01-03 00:58:21
18      3       g 2018-03-01 01:00:35
19      3       g 2018-03-01 01:31:16
rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • I'll accept your answer because it produces the expected result. Thank you for teaching me 'ffill', which I did not know. However, this method is not I was looking for. I wrote an additional question to clarify the issue ('Pandas: filling missing values by the time occurrence of an event'), as the revisions I had to write were extensive, and the desired output different. – EAMC Aug 31 '18 at 10:59
  • You bet, feel free to link your new question here so both I and others can follow up if interested. – rahlf23 Aug 31 '18 at 14:11