0

I have a system log file, for which I would like to count the number of a subsequent event after each event. The events are listed in a file, that is in a long format, in which each event is recorded as a row with a timestamp. Thus, in order to count subsequent events, I would have to count these after each row for each user.

My data frame looks like this (so in this example, the event would be new and the subsequent event to be counted would be analysis, hope you get the gist):

    created_at_date    id   type     username
0   2017-11-13 09:17:59 1   new      user1
1   2017-11-13 09:18:00 1   analysis user1
2   2017-11-13 09:18:20 1   feedback user1
3   2017-11-13 09:26:58 NaN id_new   user2

The desired output is a matrix, counting the subsequent occurrences after each event.

                               Event
             login  id_new  new   analysis   feedback   check   edit
subsequent event                                
login        0      0       1     0         0           2      0
id_new       0      0       1     0         0           0      0 
new          0      0       0     3         0           0      0
analysis     0      0       1     1         3           0      1 
feedback     1      0       1     0         0           2      0 
check        0      0       0     0         0           0      0 
edit         0      0       0     0         0           0      0 

As such, I first created an empty matrix:

a = np.zeros(shape=(7,7))
names=['new', 'analysis', 'feedback','id_new','login', 'check','edit']
df_empty = pd.DataFrame(a, index=names, columns=names)

and then fill the matrix with a solution adapted from here, however, just iterating over the rows I get an error saying ValueError: cannot label index with a null key

from itertools import tee, islice, chain, izip

def previous_and_next(some_iterable):
    items, nexts = tee(some_iterable, 2)
    nexts = chain(islice(nexts, 1, None), [None])
    return izip(items, nexts)

for item, nxt in previous_and_next(df['event']):
    df_empty.loc[item, nxt]= df_empty.loc[item, nxt] +1
df_empty

Any idea where this is coming from?

A reproducible example:

def data():
    data = {'created_at_date': pd.Series(['2017-11-13 09:17:59', '2017-11-13 09:18:00', '2017-11-13 09:18:20', 
                                      '2017-11-13 09:26:58', '2017-11-13 09:28:23', '2017-11-13 09:28:24', 
                                      '2017-11-13 09:40:16', '2017-11-13 09:45:37', '2017-11-13 09:46:03',
                                      '2017-11-13 09:46:38', '2017-11-13 09:46:39', '2017-11-13 09:46:48',
                                      '2017-11-13 09:58:19', '2017-11-13 10:04:44', '2017-11-13 10:04:56',
                                     '2018-04-13 10:02:36',  '2018-04-13 10:02:37', '2018-04-13 10:02:41',
                                     '2018-04-13 10:54:22', '2017-11-17 11:15:03', '2017-11-17 11:15:40',
                                     '2018-04-13 13:38:33']),
        'username': pd.Series(['user1', 'user1', 'user1', 
                               'user2', 'user2', 'user2', 
                               'None', 'None', 'user3',
                               'user3', 'user3', 'user3',
                               'None', 'user3', 'user3',
                               'user4', 'user4', 'user4',
                               'None', 'user1','user1', 'None']),
        'id': pd.Series([1, 1, 1,
                         'NaN', 2, 2,
                         'NaN', 'NaN', 'NaN',
                         3, 3, 3,
                         'NaN', 'NaN', 'NaN',
                         4, 4, 'NaN',
                         'NaN', 5, 5,'NaN']),
       'event': pd.Series(['new', 'analysis', 'feedback', 
                          'id_new', 'new', 'analysis', 
                          'login', 'login', 'new',
                          'analysis', 'feedback', 'login',
                          'check', 'new', 'new', 
                          'analysis','edit', 'login', 
                          'check', 'analysis','feedback',
                          'login'])}
    df = pd.DataFrame(data)
    df['created_at_date'] = pd.to_datetime(df['created_at_date'])
    return df
df=data()    

Additional: how would I iterate over grouped data?

EDIT: @coldspeed's response offers an alternative, but does not explain the error I received with a different method. So any additional advice would be welcome!

mizzlosis
  • 515
  • 1
  • 4
  • 17
  • 3
    Are you looking for `df.pivot_table(index='event', columns=df.event.shift(-1), aggfunc='size', fill_value=0)`? – cs95 May 12 '18 at 19:38
  • Thanks! Didn't know it was that easy with pivot tables. Are the columns or the rownames the subsequent events? I have several restrictions that I would like to apply at a later stage, such as restricting the sequence at which it is counted to individual users (e.g. so that in above snippet, feedback and id_new is not counted) and therefore I think the above approach would make it possible & I am not sure if this is manageable with pivot_tables. Please let me know if it is! – mizzlosis May 12 '18 at 20:02
  • The columns are the subsequent events. By the way, if you want to apply those restrictions, you will need to do it beforehand. Pivot is dumb, it just does the counting. – cs95 May 12 '18 at 20:10
  • @cᴏʟᴅsᴘᴇᴇᴅ It's better to put it as an answer so OP can accept – Vivek Kalyanarangan May 13 '18 at 14:07

0 Answers0