0

I have two pandas.dataframe objects, A and B. A contains a large amount of streamed data with associated timestamps. B contains events that occured within the stream, although those event times are written independently from the logtimes in A which means they aren't perfectly matching. I want to label all rows in A, based on the event log in B.

I think I have something that works but it feels very sloppy and I would like to know if there is a more efficient way. I'm also not entirely sure how accurate my solution is in the real dataset (because I can't search through thousands of rows to check). Here is my example data:

A
Out[1]: 
   logtime
0  1.23435
1  2.31235
2  3.32356
3  4.74532
4  5.34566
5  6.47245
6  7.93357
7  8.36548
8  9.87456

B
    Out[2]: 
      time_loc  event  event_time
    0    start      1         1.1
    1     stop      1         4.1
    2    start      2         4.2
    3     stop      2         7.5
    4    start      3         7.7
    5     stop      3         9.9

Example of desired output:      
    logtime     event
    1.23435     1
    2.31235     1
    3.32356     1
    4.74532     2
    5.34566     2
    6.47245     2
    7.93357     3
    8.36548     3
    9.87456     3

And here is my solution:

import pandas as pd

A = pd.DataFrame({
    'logtime':[1.23435,2.31235,3.32356,4.74532,5.34566,6.47245,7.93357,8.36548,9.87456]
    })

B = pd.DataFrame({
    'time_loc':['start', 'stop', 'start', 'stop', 'start', 'stop'],
    'event':[1,1,2,2,3,3],
    'event_time':[1.1,4.1,4.2,7.5,7.7,9.9]
    })


# A.loc[A['logtime'] > B.loc[B['time_loc']=='start']['event_time']] # gives error.

# make wide format
Bw = B.set_index(['time_loc', 'event']).unstack(0)
Bw.columns = [f'{i}_{j}' for i, j in Bw.columns]
Bw = Bw.reset_index()
# Create an events Series with bins...
events = (pd.cut(x=A['logtime'],
                bins=(Bw['event_time_start'] + Bw['event_time_stop'] - Bw['event_time_start']).values,
                labels=False)+1).fillna(0)

Aout = pd.concat([A, events.rename('event')+1], axis=1)
print(Aout)

Out[3]
   logtime  event
0  1.23435    1.0
1  2.31235    1.0
2  3.32356    1.0
3  4.74532    2.0
4  5.34566    2.0
5  6.47245    2.0
6  7.93357    3.0
7  8.36548    3.0
8  9.87456    3.0

So this seems to be correct from this example, but I really don't trust it because it feels hacky.

fffrost
  • 1,659
  • 1
  • 21
  • 36
  • 1
    you can try [`merge_asof`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html) : `pd.merge_asof(A,B,left_on='logtime',right_on='event_time').loc[:,['logtime','event']]` , by controlling the threshold parameter you can control the merge – anky Mar 11 '20 at 13:08
  • https://stackoverflow.com/a/44601120/7964527 – BENY Mar 11 '20 at 13:12
  • The answer in that question does not work because my two dataframes have different dimensions. – fffrost Mar 11 '20 at 13:22

0 Answers0