2

I have a data frame as:

Time    InvInstance
5          5
8          4
9          3
19         2
20         1
3          3
8          2
13         1

Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:

import pandas as pd
import numpy as np

df = pd.read_csv("test.csv")

df["10mMark"] = 0
for i in range(1,len(df)):
    r = int(df.InvInstance.iloc[i])
    rprev = int(df.InvInstance.iloc[i-1])
    m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)

And the desired output is:

Time  InvInstance  10mMark
 5            5        0
 8            4        0
 9            3        0
19            2        1
20            1        0
 3            3        0
 8            2        1
13            1        0

To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Gursel Karacor
  • 1,137
  • 11
  • 21

2 Answers2

4

I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:

from numba import jit

@jit
def dyn_shift(s, step):
    assert len(s) == len(step), "[s] and [step] should have the same length"
    assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
    assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
    N = len(s)
    res = np.empty(N, dtype=s.dtype)
    for i in range(N):
        res[i] = s[i+step[i]-1]
    return res

mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)

result:

In [6]: df
Out[6]:
   Time  InvInstance  10mMark
0     5            5        0
1     8            4        0
2     9            3        0
3    19            2        1
4    20            1        0
5     3            3        0
6     8            2        1
7    13            1        0

Timing for 8.000 rows DF:

In [13]: df = pd.concat([df] * 10**3, ignore_index=True)

In [14]: df.shape
Out[14]: (8000, 3)

In [15]: %%timeit
    ...: df["10mMark"] = 0
    ...: for i in range(1,len(df)):
    ...:     r = int(df.InvInstance.iloc[i])
    ...:     rprev = int(df.InvInstance.iloc[i-1])
    ...:     m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
    ...:     mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
    ...:     df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
    ...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %%timeit
    ...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
    ...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
    ...: df['10mMark'] = np.where(mask1 & mask2,1,0)
    ...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

speed-up factor:

In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    Actually your answer is a good generic solution and could also be a solution for that question: https://stackoverflow.com/questions/45023685/pandas-shift-one-column-by-other-column-value Right? – Gursel Karacor Jun 25 '18 at 09:06
3

Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.

# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum() 
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number 
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)

your final result before dropping, to see what as been created, looks like

   Time  InvInstance  block  Time_last   m  10mMark
0     5            5    1.0         20  15        0
1     8            4    1.0         20  12        0
2     9            3    1.0         20  11        0
3    19            2    1.0         20   1        1
4    20            1    1.0         20   0        0
5     3            3    2.0         13  10        0
6     8            2    2.0         13   5        1
7    13            1    2.0         13   0        0

in which the column 10mMark has the expected result

It is not as efficient as with the solution of @MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.

Ben.T
  • 29,160
  • 6
  • 32
  • 54