2

I have a dataframe containing a timestamp column. My objective is to find the first timestamp for every row that is greater than the timestamp of the row by a given offset (say 0.01 sec). I tried using the answer given here, in a modified way like:

def getTime(col, base, offset=0.01):
    later = filter(lambda d: d > base+dt.timedelta(seconds=offset), col)
    return min(later)

testTbt['testTime']=testTbt['exchTstamp'].apply(lambda time: getTime(testTbt[testTbt['exchTstamp']>time]['exchTstamp'].iloc[0:100].tolist(), time, 0.01))

But got the error: ValueError: min() arg is an empty sequence.

The timestamps are of microsecond level, contain duplicate values and are in increasing order.

Here is the initial rows of the data frame:

    exchTstamp                  seqNum prev10mSecAvg prev1SecAvg    imbRegime
0   2019-08-14  09:15:00.022991 199    0.000000      0.000000       0
1   2019-08-14  09:15:00.022995 200   -0.166667     -0.166667       3
2   2019-08-14  09:15:00.022999 201   -0.277778     -0.277778       2
3   2019-08-14  09:15:00.023003 202   -0.333333     -0.333333       2
4   2019-08-14  09:15:00.023007 203   -0.386667     -0.386667       2

A csv of the data can be downloaded here.

Expected output is an added column in the dataframe containing the first timestamp from subsequent timestamps which is minimum 0.01sec greater than current row's timestamp. Eg:- for given dataset, the new column for the first row (having time 09:15:00.022995) should have value 09:15:00.033136. Same for every subsequent rows.

A little guidance would be of much help.

Legolas
  • 653
  • 1
  • 9
  • 17
  • 1
    Can you add some sample data with expected output, [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? – jezrael Aug 22 '19 at 07:24
  • @Legolas Should your example of `09:15:00.022995` not match up with `09:15:00.033136`? As it is the next value in `exchTstamp` that occurs 0.01 seconds after. – Ted Aug 22 '19 at 12:05

3 Answers3

2

Ok, probably not the most elegant way to handle a problem like this, but this will get the job done:

import numpy as np
import pandas as pd

df = pd.read_csv("stack.csv", index_col=0)
df["exchTstamp"] = df["exchTstamp"].apply(pd.to_datetime)

def getTime(base_idx, offset=0.01):
    time_delta, i = 0, 0
    while time_delta < offset:
        time_delta = (df["exchTstamp"][base_idx + i] - df["exchTstamp"][base_idx]).total_seconds()
        i += 1
        if base_idx + i == len(df.index):
            return(np.nan)
    return(df["exchTstamp"][base_idx + i])

df["testTime"] = [getTime(j) for j in range(len(df.index))]

That then gives you:

df.head(10)
    exchTstamp                 seqNum rev10mSecAvg prev1SecAvg imbRegime    testTime
0   2019-08-14 09:15:00.022991  199    0.000000    0.000000 0   2019-08-14 09:15:00.033136
1   2019-08-14 09:15:00.022995  200   -0.166667   -0.166667 3   2019-08-14 09:15:00.033136
2   2019-08-14 09:15:00.022999  201   -0.277778   -0.277778 2   2019-08-14 09:15:00.033136
3   2019-08-14 09:15:00.023003  202   -0.333333   -0.333333 2   2019-08-14 09:15:00.033136
4   2019-08-14 09:15:00.023007  203   -0.386667   -0.386667 2   2019-08-14 09:15:00.033136
5   2019-08-14 09:15:00.023011  204   -0.422222   -0.422222 0   2019-08-14 09:15:00.033136
6   2019-08-14 09:15:00.023015  205   -0.447619   -0.447619 0   2019-08-14 09:15:00.033136
7   2019-08-14 09:15:00.023018  206   -0.475000   -0.475000 0   2019-08-14 09:15:00.033136
8   2019-08-14 09:15:00.023023  207   -0.422222   -0.422222 1   2019-08-14 09:15:00.033136
9   2019-08-14 09:15:00.023027  208   -0.380000   -0.380000 3   2019-08-14 09:15:00.033136
Ted
  • 1,189
  • 8
  • 15
  • Thanks, your code gives me correct results, but it takes nearly 50 mins to process the data I have (720k rows). I'll try optimizing it if there's any scope – Legolas Aug 23 '19 at 06:19
  • @Legolas No problem. Yes my answer was more of a "quick and dirty" approach, @Stef's answer using `searchsorted` was much more elegant. – Ted Aug 26 '19 at 06:05
2

I think the fastest and easiest way is using searchsorted. If there is no element greater than df.exchTstamp+delta, then searchsorted will return the length of the array, i.e. an index out of bounds of our dataframe index. Therefore we first need to insert a nan / NaT sentinel row for this case, which we remove afterwards:

import numpy as np, pandas as pd, datetime as dt

df = pd.read_csv('stack.csv', index_col=0, parse_dates=[1])
delta = dt.timedelta(seconds=0.01)

res = np.searchsorted(df.exchTstamp, df.exchTstamp+delta)

# add sentinel
df.append(pd.Series(), ignore_index=True)

df["testTime"] = df.loc[res,'exchTstamp'].values

# remove sentinel
df = df.drop(df.index[-1])

df.to_csv('stack-out.csv')
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Takes about 0.01 s for the 1500 rows data set. – Stef Aug 23 '19 at 08:50
  • and 1.23 sec for my 720k rows data set. Amazing. Thanks! – Legolas Aug 24 '19 at 09:25
  • One thing to note for this answer: Make sure your data is properly indexed. In case you have taken the dataframe as a sliced copy from another dataframe, run `df.index=range(len(df))` – Legolas Sep 11 '19 at 13:15
1

'Filter' results in empty list at the end of the dataframe. And it is luxury to get all the timestamps greater than the base one as the datas are in cronlogical order.

import numpy as np, pandas as pd, datetime as dt

df=pd.read_csv("stack.csv",parse_dates=[1],index_col=0)

l=[]       

for i in df.index: 
    l.append(None) 
    start=df.at[i,"exchTstamp"] 
    for k in range(i+1,len(df.index)): 
        if start<=df.at[k,"exchTstamp"]-dt.timedelta(seconds=0.01): 
            l[-1]=df.at[k,"exchTstamp"] 
            break 

df["testTime"]= l 
Legolas
  • 653
  • 1
  • 9
  • 17
kantal
  • 2,331
  • 2
  • 8
  • 15
  • Thanks for the efforts, but I got the following error: `AttributeError: 'RangeIndex' object has no attribute 'stop'` – Legolas Aug 23 '19 at 06:13
  • After a couple of corrections, your codes works amazingly fast, taking around 14 mins for the entire data set. – Legolas Aug 23 '19 at 07:20