Python Pandas - Minutes since last occurrence in 2 million row dataframe

Question

As an example, I have the following dataframe:

Date                 indicator_1    indicator_2
2013-04-01 03:50:00       x             w
2013-04-01 04:00:00       y             u
2013-04-01 04:15:00       z             v
2013-04-01 04:25:00       x             w 
2013-04-01 04:25:00       z             u
2013-04-01 04:30:00       y             u
2013-04-01 04:35:00       y             w
2013-04-01 04:40:00       z             w
2013-04-01 04:40:00       x             u
2013-04-01 04:40:00       y             v
2013-04-01 04:50:00       x             w

My goal is to create two columns with the following rules:

The first column should give me the number of minutes since the last occurrence of 'x' on the indicator_1 column.
The second column should give me the number of minutes since the last occurrence of the pair 'y' on the indicator_1 and 'u' on the indicator_2 column.

For rows that have the same exact hour and one of the times corresponds to either 'x' (in the first case), or the pair 'y','u' (in the second case), the calculation of the number of minutes should be made with respect to the previous occurrence of the variables. Therefore, the desired output should be something like:

 Date               desired_column_1   desired_column_2  indicator_1 indicator_2
2013-04-01 03:50:00         NaN                NaN          x              w
2013-04-01 04:00:00         10.0               NaN          y              u
2013-04-01 04:15:00         25.0               15.0         z              v
2013-04-01 04:25:00         35.0               25.0         x              w
2013-04-01 04:25:00         35.0               25.0         z              u
2013-04-01 04:30:00          5.0               30.0         y              u
2013-04-01 04:35:00         10.0                5.0         y              w
2013-04-01 04:40:00         15.0               10.0         z              w
2013-04-01 04:40:00         15.0               10.0         x              u
2013-04-01 04:40:00         15.0               10.0         y              v
2013-04-01 04:50:00         10.0               20.0         x              w

The main problem is that the entire dataframe has more than 2 million rows, therefore using a loop is too time consuming. Is there any way to implement a vectorized approach to this problem?

The python code of the dataframe is the following:

d = {'Date': ['2013-04-01 03:50:00','2013-04-01 04:00:00','2013-04-01 
04:15:00','2013-04-01 04:25:00','2013-04-01 04:25:00',
'2013-04-01 04:30:00','2013-04-01 04:35:00','2013-04-01 04:40:00','2013-04-01 04:40:00','2013-04-01 04:40:00',
'2013-04-01 04:50:00'], 'indicator_1': ['x','y','z','x','z','y','y','z','x','y','x'], 
 'indicator_2': ['w','u','v','w','u','u','w','w','u','v','w'],
 'desired_column_1': [np.nan, 10, 25, 35, 35,5,10,15,15,15,10],
 'desired_column_2': [np.nan, np.nan, 15, 25, 25,30,5,10,10,10,20]}

df = pd.DataFrame(data=d)

Use this answer for case 1, then create a indicator_combined for case 2, then the same https://stackoverflow.com/questions/20648346/computing-diffs-within-groups-of-a-dataframe — Phung Duy Phong, Aug 15 '18 at 10:31
Also, using for loop is not really ideal in python i think, https://stackoverflow.com/questions/47755442/what-is-vectorization — Phung Duy Phong, Aug 15 '18 at 10:36

score 4 · Accepted Answer · answered Aug 15 '18 at 13:44

First ensure column ['Date'] is a datetime object and get a column to represent the difference in time from row to row

df.Date = pd.to_datetime(df.Date)
df['minD'] = (df.Date -df.Date.shift(1)).astype('timedelta64[m]')

Next create a grouping key for your conditions. We shift down one row since we are looking for the time since last x and this can include the next x value as well. Without shifting, we would not include the next x in our group.

mask2 = (df.indicator_1.str.cat(df.indicator_2) == 'yu').cumsum().shift(1)
mask1 = (df.indicator_1 == 'x').cumsum().shift(1)

Now groupby the masks and cumsum() the minute differences, but we need to filter out the cumsum() < 1 of the boolean values since the condition has not yet happened, and thus there should be missing values for difference in time.

df['desired_column_1'] = df.groupby(mask1.where(mask1 > 0)).minD.cumsum() 
df['desired_column_2'] = df.groupby(mask2.where(mask2 > 0)).minD.cumsum()

Now you can replace the 0 values in those columns by forward filling the data

df.desired_column_1 = df.desired_column_1.replace(0,method='ffill')
df.desired_column_2 = df.desired_column_2.replace(0,method='ffill')

This yeilds

               Date indicator_1 indicator_2  desired_column_1  \
0  2013-04-01 03:50:00           x           w               NaN
1  2013-04-01 04:00:00           y           u              10.0
2  2013-04-01 04:15:00           z           v              25.0
3  2013-04-01 04:25:00           x           w              35.0
4  2013-04-01 04:25:00           z           u              35.0
5  2013-04-01 04:30:00           y           u               5.0
6  2013-04-01 04:35:00           y           w              10.0
7  2013-04-01 04:40:00           z           w              15.0
8  2013-04-01 04:40:00           x           u              15.0
9  2013-04-01 04:40:00           y           v              15.0
10 2013-04-01 04:50:00           x           w              10.0

    desired_column_2
0                NaN
1                NaN
2               15.0
3               25.0
4               25.0
5               30.0
6                5.0
7               10.0
8               10.0

score 0 · Answer 2 · answered Aug 15 '18 at 10:35

0

df = df.loc[:, ['Date', 'indicator_1', 'indicator_2']]
idx = df.index
df['Date'] = df['Date'].apply(pd.to_datetime)

# Sort by column indicator_1 for using df.diff()
df.sort_values(['indicator_1'], inplace=True)
df['diffs1'] = df['Date'].diff()
# Shift 1 then compare with original to get the line that value changes (from x to y for init)
mask = df.indicator_1 != df.indicator_1.shift(1)
df['diffs1'][mask] = np.nan
df.reindex(idx)

# Same for case 2
df['indicator_3'] = df['indicator_1'] + df['indicator_2']
df.sort_values(['indicator_3'], inplace=True)
df['diffs2'] = df['Date'].diff()
mask = df.indicator_3 != df.indicator_3.shift(1)
df['diffs2'][mask] = np.nan
df.reindex(idx)

answered Aug 15 '18 at 10:35

Phung Duy Phong

876
6
18

Thank you for your answer! I tried your code for the first case, however, I am not obtaining the values that I wanted on the desired column. It is quite different actually. The goal is to count the number of minutes since the last occurrence of variable 'x'. – Miguel Lambelho Aug 15 '18 at 10:55
Oh sorry i did not quite understand your situation, – Phung Duy Phong Aug 15 '18 at 11:06

Python Pandas - Minutes since last occurrence in 2 million row dataframe

2 Answers2