I have a system log file, for which I would like to count the number of a subsequent event after each event. The events are listed in a file, that is in a long format, in which each event is recorded as a row with a timestamp. Thus, in order to count subsequent events, I would have to count these after each row for each user.
My data frame looks like this (so in this example, the event would be new and the subsequent event to be counted would be analysis, hope you get the gist):
created_at_date id type username
0 2017-11-13 09:17:59 1 new user1
1 2017-11-13 09:18:00 1 analysis user1
2 2017-11-13 09:18:20 1 feedback user1
3 2017-11-13 09:26:58 NaN id_new user2
The desired output is a matrix, counting the subsequent occurrences after each event.
Event
login id_new new analysis feedback check edit
subsequent event
login 0 0 1 0 0 2 0
id_new 0 0 1 0 0 0 0
new 0 0 0 3 0 0 0
analysis 0 0 1 1 3 0 1
feedback 1 0 1 0 0 2 0
check 0 0 0 0 0 0 0
edit 0 0 0 0 0 0 0
As such, I first created an empty matrix:
a = np.zeros(shape=(7,7))
names=['new', 'analysis', 'feedback','id_new','login', 'check','edit']
df_empty = pd.DataFrame(a, index=names, columns=names)
and then fill the matrix with a solution adapted from here, however, just iterating over the rows I get an error saying ValueError: cannot label index with a null key
from itertools import tee, islice, chain, izip
def previous_and_next(some_iterable):
items, nexts = tee(some_iterable, 2)
nexts = chain(islice(nexts, 1, None), [None])
return izip(items, nexts)
for item, nxt in previous_and_next(df['event']):
df_empty.loc[item, nxt]= df_empty.loc[item, nxt] +1
df_empty
Any idea where this is coming from?
A reproducible example:
def data():
data = {'created_at_date': pd.Series(['2017-11-13 09:17:59', '2017-11-13 09:18:00', '2017-11-13 09:18:20',
'2017-11-13 09:26:58', '2017-11-13 09:28:23', '2017-11-13 09:28:24',
'2017-11-13 09:40:16', '2017-11-13 09:45:37', '2017-11-13 09:46:03',
'2017-11-13 09:46:38', '2017-11-13 09:46:39', '2017-11-13 09:46:48',
'2017-11-13 09:58:19', '2017-11-13 10:04:44', '2017-11-13 10:04:56',
'2018-04-13 10:02:36', '2018-04-13 10:02:37', '2018-04-13 10:02:41',
'2018-04-13 10:54:22', '2017-11-17 11:15:03', '2017-11-17 11:15:40',
'2018-04-13 13:38:33']),
'username': pd.Series(['user1', 'user1', 'user1',
'user2', 'user2', 'user2',
'None', 'None', 'user3',
'user3', 'user3', 'user3',
'None', 'user3', 'user3',
'user4', 'user4', 'user4',
'None', 'user1','user1', 'None']),
'id': pd.Series([1, 1, 1,
'NaN', 2, 2,
'NaN', 'NaN', 'NaN',
3, 3, 3,
'NaN', 'NaN', 'NaN',
4, 4, 'NaN',
'NaN', 5, 5,'NaN']),
'event': pd.Series(['new', 'analysis', 'feedback',
'id_new', 'new', 'analysis',
'login', 'login', 'new',
'analysis', 'feedback', 'login',
'check', 'new', 'new',
'analysis','edit', 'login',
'check', 'analysis','feedback',
'login'])}
df = pd.DataFrame(data)
df['created_at_date'] = pd.to_datetime(df['created_at_date'])
return df
df=data()
Additional: how would I iterate over grouped data?
EDIT: @coldspeed's response offers an alternative, but does not explain the error I received with a different method. So any additional advice would be welcome!