I would like to fill dataframe columns with the time difference between the current timestamp and the closest timestamp of "type A" or "not type A", respectively, i.e. type_A = 1 or type_A = 0. The following shows a small example:
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,3,4],
'tmstmp':[datetime(2018,5,4,13,27,10), datetime(2018,5,3,13,27,10),
datetime(2018,5,2,13,27,10), datetime(2018,5,1,13,27,10)],
'type_A':[0, 1, 0, 1],
'dt_A': [np.nan]*4,
'dt_notA': [np.nan]*4
})
(A and non-A rows do not necessarily alternate, but the timestamp column is already sorted in descending order). I calculate the time difference between the timestamp in the current row and the next row with type_A=1 or type_A=0, respectively, by iterating over the integer row index and access elements by this integer index and the column name:
keys = {1: 'dt_A', 0: 'dt_notA'}
ridx = 0
while ridx + 1 < df.shape[0]:
ts1 = df.iloc[ridx]['tmstmp']
ts2 = df.iloc[ridx + 1]['tmstmp']
found = 0 if df.iloc[ridx + 1]['type_A'] == 0 else 1
key = keys[found]
df.loc[ridx, key] = (ts1 - ts2).total_seconds()/3600
complement = 1 - found
j = 2
while ridx + j < df.shape[0] and df.iloc[ridx + j]['type_A'] != complement:
j += 1
if ridx + j < df.shape[0]:
ts1 = df.iloc[ridx]['tmstmp']
ts2 = df.iloc[ridx + j]['tmstmp']
val = (ts1 - ts2).total_seconds()/3600
else:
val = np.nan
df.loc[ridx, keys[complement]] = val
ridx += 1
Iteration over a dataframe is "discouraged from" for efficiency reasons (see How to iterate over rows in a DataFrame in Pandas? ) and using integer indices is even less "pythonic", so my question is: in this particular case, is there a "better" (more efficient, more pythonic) way to iterate over the dataframe to achieve the given task? Many thanks for any suggestions or thoughts!
Edit: the input and output dataframes for the small example - the column dt_A
contains the time deltas between the current row and the next one that has type_A = 1
, dt_notA
contains the time deltas with the closest row that has type_A = 0
.
input:
id tmstmp type_A dt_A dt_notA
0 1 2018-05-04 13:27:10 0 NaN NaN
1 2 2018-05-03 13:27:10 1 NaN NaN
2 3 2018-05-02 13:27:10 0 NaN NaN
3 4 2018-05-01 13:27:10 1 NaN NaN
output:
id tmstmp type_A dt_A dt_notA
0 1 2018-05-04 13:27:10 0 24.0 48.0
1 2 2018-05-03 13:27:10 1 48.0 24.0
2 3 2018-05-02 13:27:10 0 24.0 NaN
3 4 2018-05-01 13:27:10 1 NaN NaN