0

I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change. So, I have the following dataset (it's already sorted by date and player):

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'],
'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09', 
         '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
          '2018-01-01 00:30:35',  '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16', 
         '2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22', 
         '2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'], 
'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 
       'g', np.nan, 'f', 'g']}

#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] =  pd.to_datetime(df['date']) 
df

   player      date         id
0   1   2018-01-01 00:17:01 NaN
1   1   2018-01-01 00:17:05 NaN
2   1   2018-01-01 00:19:05 a
3   1   2018-01-01 00:21:07 a
4   1   2018-01-01 00:22:09 b
5   1   2018-01-01 00:22:07 NaN
6   1   2018-01-01 00:25:09 b
7   1   2018-01-01 00:25:11 c
8   1   2018-01-01 00:27:28 c
9   1   2018-01-01 00:29:29 c
10  1   2018-01-01 00:30:35 c
11  2   2018-02-01 00:31:16 d
12  2   2018-02-01 00:35:22 d
13  2   2018-02-01 00:38:16 e
14  2   2018-02-01 00:38:20 e
15  2   2018-02-01 00:55:15 NaN
16  3   2018-01-03 00:55:22 f
17  3   2018-01-03 00:58:16 f
18  3   2018-01-03 00:58:21 g
19  3   2018-03-01 01:00:35 NaN
20  3   2018-03-01 01:20:16 f
21  3   2018-03-01 01:31:16 g

So, these are my three columns:

  1. 'player' - dtype = object
  2. 'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
  3. 'date' (datetime object) tells us the time at which each action was implemented. The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.

Ok, so here I have my missing values:

df.loc[df.id.isnull(),'date']
0     2018-01-01 00:17:01
1     2018-01-01 00:17:05
5     2018-01-01 00:22:07
15    2018-02-01 00:55:15
19    2018-03-01 01:00:35

Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players. I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).

my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
                  min                      max
player  id      
1       a   2018-01-01 00:19:05   2018-01-01 00:21:07
        b   2018-01-01 00:22:09   2018-01-01 00:25:09
        c   2018-01-01 00:25:11   2018-01-01 00:30:35
2       d   2018-02-01 00:31:16   2018-02-01 00:35:22
        e   2018-02-01 00:38:16   2018-02-01 00:38:20
3       f   2018-01-03 00:55:22   2018-03-01 01:20:16
        g   2018-01-03 00:58:21   2018-03-01 01:31:16

Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.

In the dataset I try to illustrate three possible scenarios I am interested in:

  1. the action occurred between the first and last date of a certain session. In this case I would like to fill the missing value with the id of that session, as it clearly belongs to that session. Row 5 of the dataset should therefore be labeled as 'b', as it occurs within the range of b.
  2. I would mark as '0' the session where the action occurred outside the range of any session - for example the first two Nans and row 15.
  3. Finally, mark it as '-99' if it is not possible to associate the action to a single session, because it occurred during the time range of different session. This is the case of row 19, the last Nan.

Desired output: to sum it up, the outcome should look like this df:

  player      date         id
0   1   2018-01-01 00:17:01 0
1   1   2018-01-01 00:17:05 0
2   1   2018-01-01 00:19:05 a
3   1   2018-01-01 00:21:07 a
4   1   2018-01-01 00:22:09 b
5   1   2018-01-01 00:22:07 b
6   1   2018-01-01 00:25:09 b
7   1   2018-01-01 00:25:11 c
8   1   2018-01-01 00:27:28 c
9   1   2018-01-01 00:29:29 c
10  1   2018-01-01 00:30:35 c
11  2   2018-02-01 00:31:16 d
12  2   2018-02-01 00:35:22 d
13  2   2018-02-01 00:38:16 e
14  2   2018-02-01 00:38:20 e
15  2   2018-02-01 00:55:15 0
16  3   2018-01-03 00:55:22 f
17  3   2018-01-03 00:58:16 f
18  3   2018-01-03 00:58:21 g
19  3   2018-03-01 01:00:35 -99
20  3   2018-03-01 01:20:16 f
21  3   2018-03-01 01:31:16 g
EAMC
  • 45
  • 5

2 Answers2

0

May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select:

 df['shift'] = df['id'].shift(1)
df['shift-1'] = df['id'].shift(-1)
df['merge'] = df[['shift','shift-1']].values.tolist()
df.drop(columns=['shift','shift-1'], inplace=True)

alpha = {np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8}
diff = []
for i in range(len(df)):
    diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]])

df['diff'] = diff

conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))),

              (~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) | 
                                                                                df['player'].shift(-1).eq(df['player'])) &
              (~df['diff'] < 0)),

              (~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) | 
                                                                                df['player'].shift(-1).eq(df['player'])) &
              (df['diff'] < 0)),


             ]
choices = [df['id'].ffill(),
           0,
           -99
          ]
df['id'] = np.select(conditions, choices, default = df['id'])
df.drop(columns=['merge','diff'], inplace=True)
df

out:

    player  date              id
0   1   2018-01-01 00:17:01   0
1   1   2018-01-01 00:17:05   0
2   1   2018-01-01 00:19:05   a
3   1   2018-01-01 00:21:07   a
4   1   2018-01-01 00:22:09   b
5   1   2018-01-01 00:22:07   b
6   1   2018-01-01 00:25:09   b
7   1   2018-01-01 00:25:11   c
8   1   2018-01-01 00:27:28   c
9   1   2018-01-01 00:29:29   c
10  1   2018-01-01 00:30:35   c
11  2   2018-02-01 00:31:16   d
12  2   2018-02-01 00:35:22   d
13  2   2018-02-01 00:38:16   e
14  2   2018-02-01 00:38:20   e
15  2   2018-02-01 00:55:15   0
16  3   2018-01-03 00:55:22   f
17  3   2018-01-03 00:58:16   f
18  3   2018-01-03 00:58:21   g
19  3   2018-03-01 01:00:35  -99
20  3   2018-03-01 01:20:16   f
21  3   2018-03-01 01:31:16   g
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • Thank you @Chris for your help, but actually ffill it's not was I was expecting. What I needed was to see if every row with a missing session id falls within range of the available sessions, and I need to do this comparison player by player. I think I found a solution (see my code below). – EAMC Sep 03 '18 at 09:37
0

In my solution I just had to work a bit to apply correctly the function wrote by @ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.

#define a function to sort the missing values (ysearka function from stackoverflow)
def my_custom_function(time):
    #compare every date event with the range of the sessions. 
    current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
    #store length, that is the number of matches. 
    count = len(current_sessions)
    #How many matches are there for any missing id value?
    # if 0 it means that no matches are found: the event lies outside all the possible ranges
    if count == 0:
        return 0
    #if more than one, it is impossible to say to which session the event belongs
    if count > 1:
        return -99
    #equivalent to if count == 1 return: in this case the event belongs clearly to just one session
    return current_sessions.index[0][1]


#create a list storing all the player ids
plist = list(df.player.unique())

#ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None

# create an empty new dataframe, where to store the results
final = pd.DataFrame()
#with this for loop iterate over the part of the dataset corresponding to one player at a time
for i in plist:
    #slice the dataset by player
    players = df.loc[df['player'] == i]
    #for every player, take the dates where we are missing the id
    mv_per_player = players.loc[players.id.isnull(),'date']
    #for each player, groupby player id, and compute the first and last event
    my_agg = players.groupby(['player', 'id']).date.agg([min, max])
    #apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan
    ema = mv_per_player.apply(my_custom_function)    
    #now we can sobstitute the missing id with the new imputed values...
    players.loc[players.id.isnull(),'id'] = ema.values    
    #append new values stored in players to the new dataframe
    final = final.append(players)

#...and check the new dataset
final

player  date    id
0   1   2018-01-01 00:17:01 0
1   1   2018-01-01 00:17:05 0
2   1   2018-01-01 00:19:05 a
3   1   2018-01-01 00:21:07 a
4   1   2018-01-01 00:22:09 b
5   1   2018-01-01 00:22:17 b
6   1   2018-01-01 00:25:09 b
7   1   2018-01-01 00:25:11 c
8   1   2018-01-01 00:27:28 c
9   1   2018-01-01 00:29:29 c
10  1   2018-01-01 00:30:35 c
11  2   2018-02-01 00:31:16 d
12  2   2018-02-01 00:35:22 d
13  2   2018-02-01 00:38:16 e
14  2   2018-02-01 00:38:20 e
15  2   2018-02-01 00:55:15 0
16  3   2018-01-03 00:55:22 f
17  3   2018-01-03 00:58:16 f
18  3   2018-01-03 00:58:21 g
19  3   2018-03-01 01:00:35 -99
20  3   2018-03-01 01:20:16 f
21  3   2018-03-01 01:31:16 g

I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).

EAMC
  • 45
  • 5