0

I would like to fill dataframe columns with the time difference between the current timestamp and the closest timestamp of "type A" or "not type A", respectively, i.e. type_A = 1 or type_A = 0. The following shows a small example:

import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'id':[1,2,3,4], 
                   'tmstmp':[datetime(2018,5,4,13,27,10), datetime(2018,5,3,13,27,10),
                             datetime(2018,5,2,13,27,10), datetime(2018,5,1,13,27,10)], 
                   'type_A':[0, 1, 0, 1],
                   'dt_A': [np.nan]*4,
                   'dt_notA': [np.nan]*4
                  })

(A and non-A rows do not necessarily alternate, but the timestamp column is already sorted in descending order). I calculate the time difference between the timestamp in the current row and the next row with type_A=1 or type_A=0, respectively, by iterating over the integer row index and access elements by this integer index and the column name:

keys = {1: 'dt_A', 0: 'dt_notA'}
ridx = 0
while ridx + 1 < df.shape[0]:
    ts1 = df.iloc[ridx]['tmstmp']
    ts2 = df.iloc[ridx + 1]['tmstmp']
    found = 0 if df.iloc[ridx + 1]['type_A'] == 0 else 1
    key = keys[found]
    df.loc[ridx, key] = (ts1 - ts2).total_seconds()/3600
    complement = 1 - found
    j = 2
    while ridx + j < df.shape[0] and df.iloc[ridx + j]['type_A'] != complement:
        j += 1
    if ridx + j < df.shape[0]:
        ts1 = df.iloc[ridx]['tmstmp']
        ts2 = df.iloc[ridx + j]['tmstmp']
        val = (ts1 - ts2).total_seconds()/3600
    else:
        val = np.nan
    df.loc[ridx, keys[complement]] = val
    ridx += 1

Iteration over a dataframe is "discouraged from" for efficiency reasons (see How to iterate over rows in a DataFrame in Pandas? ) and using integer indices is even less "pythonic", so my question is: in this particular case, is there a "better" (more efficient, more pythonic) way to iterate over the dataframe to achieve the given task? Many thanks for any suggestions or thoughts!

Edit: the input and output dataframes for the small example - the column dt_A contains the time deltas between the current row and the next one that has type_A = 1, dt_notA contains the time deltas with the closest row that has type_A = 0.

input: 
   id              tmstmp  type_A  dt_A  dt_notA
0   1 2018-05-04 13:27:10       0   NaN      NaN
1   2 2018-05-03 13:27:10       1   NaN      NaN
2   3 2018-05-02 13:27:10       0   NaN      NaN
3   4 2018-05-01 13:27:10       1   NaN      NaN

output:

   id              tmstmp  type_A  dt_A  dt_notA
0   1 2018-05-04 13:27:10       0  24.0     48.0
1   2 2018-05-03 13:27:10       1  48.0     24.0
2   3 2018-05-02 13:27:10       0  24.0      NaN
3   4 2018-05-01 13:27:10       1   NaN      NaN
ctenar
  • 718
  • 5
  • 24
  • it would be better if you can post the expected dataframe for validation ( a a litt;e more explanation of the logic - *may be a loop isnt required*) but not sure – anky Jan 24 '20 at 16:25
  • can you explain what you trying to do in that while loop? – Kenan Jan 24 '20 at 16:56
  • @Kenan: find the first index (counting from the current one) that has the desired type (i.e. type_A=0 or type_A=1, the complement of the type already found in the row immediately following the current one) Edit: I assumed you mean the inner one – ctenar Jan 24 '20 at 16:59

1 Answers1

1
def next_value_index(l, i, val):
    """Return index of l where val occurs next from position i."""
    try:
        return l[(i+1):].index(val) + (i + 1)
    except ValueError:
        return np.nan

def next_value_indexes(l, val):
    """Return for each position in l next-occurrence-indexes of val in l."""
    return np.array([next_value_index(l, i, val) for i, _ in enumerate(l)])

def nan_allowing_access(df, col, indexes):
    """Return df[col] indexed by indexes. A np.nan would cause errors.
    This function returns np.nan where index is np.nan."""
    idxs = np.array([idx if not np.isnan(idx) else 0 for idx in indexes])
    res = df[col].iloc[idxs]
    res[np.isnan(indexes)] = np.nan
    return res # NaT for timestamps

def diff_timestamps(dfcol1, dfcol2): # timestamp columns of pandas subtraction
    return [x - y for x, y in zip(list(dfcol1), list(dfcol2))]
    # this is not optimal in speed, but numpy did unwanted type conversions
    # problem is: np.array(df[tmstmp_col]) converts to `dtype='datetime64[ns]'`

def td2hours(timedelta): # convert timedelta to hours
    return timedelta.total_seconds() / 3600

def time_diff_to_next_val(df, tmstmp_col, col, val, converter_func, flip_subtraction=False):
    """
    Return time differences (timestamps are given in tmstmp_col column
    of the pandas data frame `df`) from the row's timestamp to the next
    time stamp of the row, which has in column `col` the next occurrence
    of value given in `val` in the data frame.

    converter_func is the function used to convert the timedelta
             value.
    flip_subtraction determines the order of subtraction: whether take current row's
             timestamp first or not when subtracting
    """
    next_val_indexes = next_value_indexes(df[col].tolist(), val)
    next_val_timestamps = nan_allowing_access(df, tmstmp_col, next_val_indexes)
    return [converter_func(x) for x in diff_timestamps(*(df[tmstmp_col], next_val_timestamps)[::(1-2*flip_subtraction)])]
    # `*(df[tmstmp_col], next_val_timestamps)[::(1-2*flip_subtraction)]`
    # flips the order of arguments when `flip_subtraction = True`

Apply the functions by:

df['dt_A'] = time_diff_to_next_val(df,'tmstmp', 'type_A', 1, converter_func = td2hours)
df['dt_notA'] = time_diff_to_next_val(df,'tmstmp', 'type_A', 0, converter_func = td2hours)

Then df becomes:

   id              tmstmp  type_A  dt_A  dt_notA
0   1 2018-05-04 13:27:10       0  24.0     48.0
1   2 2018-05-03 13:27:10       1  48.0     24.0
2   3 2018-05-02 13:27:10       0  24.0      NaN
3   4 2018-05-01 13:27:10       1   NaN      NaN

Gwang-Jin Kim
  • 9,303
  • 17
  • 30
  • 1
    Many thanks for your detailed solution, I'll walk through it in detail! Regarding the column names you are right, I accidentally swapped the labels in the key dictionary and I have corrected this in my original post. – ctenar Jan 25 '20 at 11:06