I have two pandas.dataframe
objects, A
and B
. A
contains a large amount of streamed data with associated timestamps. B
contains events that occured within the stream, although those event times are written independently from the logtimes in A
which means they aren't perfectly matching. I want to label all rows in A
, based on the event log in B
.
I think I have something that works but it feels very sloppy and I would like to know if there is a more efficient way. I'm also not entirely sure how accurate my solution is in the real dataset (because I can't search through thousands of rows to check). Here is my example data:
A
Out[1]:
logtime
0 1.23435
1 2.31235
2 3.32356
3 4.74532
4 5.34566
5 6.47245
6 7.93357
7 8.36548
8 9.87456
B
Out[2]:
time_loc event event_time
0 start 1 1.1
1 stop 1 4.1
2 start 2 4.2
3 stop 2 7.5
4 start 3 7.7
5 stop 3 9.9
Example of desired output:
logtime event
1.23435 1
2.31235 1
3.32356 1
4.74532 2
5.34566 2
6.47245 2
7.93357 3
8.36548 3
9.87456 3
And here is my solution:
import pandas as pd
A = pd.DataFrame({
'logtime':[1.23435,2.31235,3.32356,4.74532,5.34566,6.47245,7.93357,8.36548,9.87456]
})
B = pd.DataFrame({
'time_loc':['start', 'stop', 'start', 'stop', 'start', 'stop'],
'event':[1,1,2,2,3,3],
'event_time':[1.1,4.1,4.2,7.5,7.7,9.9]
})
# A.loc[A['logtime'] > B.loc[B['time_loc']=='start']['event_time']] # gives error.
# make wide format
Bw = B.set_index(['time_loc', 'event']).unstack(0)
Bw.columns = [f'{i}_{j}' for i, j in Bw.columns]
Bw = Bw.reset_index()
# Create an events Series with bins...
events = (pd.cut(x=A['logtime'],
bins=(Bw['event_time_start'] + Bw['event_time_stop'] - Bw['event_time_start']).values,
labels=False)+1).fillna(0)
Aout = pd.concat([A, events.rename('event')+1], axis=1)
print(Aout)
Out[3]
logtime event
0 1.23435 1.0
1 2.31235 1.0
2 3.32356 1.0
3 4.74532 2.0
4 5.34566 2.0
5 6.47245 2.0
6 7.93357 3.0
7 8.36548 3.0
8 9.87456 3.0
So this seems to be correct from this example, but I really don't trust it because it feels hacky.