1

I am new to python but have been in the programming world for a while. I have already tried to do the following with if else loop using python dataframes and iloc and was successful. I want to use vectorization. The problem is that once a data changes its state based upon rule 1 (b < a) I want it to remain in that state as long as rule 2 (c < a) applies.

So the input and output will look like the following:

TABLE

I have tried numpy select and numpy where with shift but the compare happens upon the same column so I couldn't get it to work. It works the first time when the rule 2 applies not after that. since above is the sample problem I created from the one I have so that code is hard to provide

The Flag is what I expect but i don't get it

df is with loops and df1 is with numpy - I can't get it to look same

import pandas as pd
import numpy as np


df = pd.DataFrame({"a": [20,25,30,32,25,26,30,24,25,30,30,35,20,22,25,28,29,31,25,21], "b": [26,28,26,35,34,34,31,26,28,29,31,32,18,24,30,35,32,35,36,30], "c":[18,19,20,22,23,24,34,22,23,31,32,38,22,21,22,21,18,22,28,31]})

df1 = df

position=''
Flag = False
df =  df.assign(flag=Flag) 

for id1, id2 in zip(df.iterrows(), df.iloc[1:].iterrows()):

   if id1[1]['b'] < id1[1]['a'] :
        Flag=True
        position='rule1True' 
   elif ( id1[1]['c'] <  id1[1]['a'] and (position =='rule1True' or position == 'rule2True') ):
        Flag = True
        position = 'rule2True'
   else :
        Flag = False
        position = ''
   df.at[id1[0],'flag'] = Flag


print(df)

df1['rule1'] = np.select([df1['b'] < df1['a']],[True], default= False)
df1['rule2'] = np.select([( df1['rule1'].shift(1) & (df1['c'] < df1['a']))],[True], default= False)
df1['flag'] = np.select([( df1['rule1'] | df1['rule2'])],[True], default= False)

print(df1)
  • Can you share the code you had working with loops? – danielR9 Apr 12 '19 at 23:15
  • That code is pretty simple in terms i set a flag once rule 1 is satisfied, and keep looping through the data. if in the next step rule 1 flag is set and rule 2 is satisfied i have the flag true. when the rule 2 is not satisfied i reset the rule 1 flag too. This is a for loop with comparison of the current data with previous data. I am not able to do this with numpy where or select. It seems we can't do recurrence as per this post https://stackoverflow.com/questions/4407984/is-it-possible-to-vectorize-recursive-calculation-of-a-numpy-array-where-each-el – getting_there Apr 12 '19 at 23:30
  • If you actually share a minimal example of your data and existing code, we can easily copy/paste and make an answer for you. See https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – danielR9 Apr 12 '19 at 23:37
  • Thanks for helping, i wrote some sample code for this and updated the post – getting_there Apr 13 '19 at 00:37
  • I added the code with the loops and with numpy – getting_there Apr 13 '19 at 00:48

1 Answers1

0

My for loop solution was very slow. Here it is a proper vector solution, this works. Very fast.

mask1 = df['b'] < df['a']
mask2 = df['c'] < df['a']
mask3 = (mask1 == False) & (mask2 == False)

This True/False flag changes when mask1 mask3 alternate.

df.loc[mask1,'flag'] = True

df.loc[mask3,'flag'] = False

Use forward fill to fill in the NaNs.

df['flag'].ffill(inplace=True)

The first few rows might be Nan, they will be false. Fill them in with False

df['flag'] = df['flag'].fillna(False)

Tested on a million rows:

CPU times: user 917 ms, sys: 7.99 ms, total: 925 ms

Sample output:

     a   b   c   flag
0   22  21  37   True
1   21  31  30  False
2   25  23  32   True
3   34  36  22   True
4   31  36  30   True
5   20  32  36  False
6   25  24  24   True
7   24  20  29   True
8   36  36  22   True
9   36  24  25   True
10  22  24  20   True
11  22  24  27  False
12  31  37  26  False
13  37  24  22   True
14  28  22  26   True
15  27  27  32  False
16  26  32  36  False
17  32  37  30  False
18  28  37  36  False
19  37  22  24   True

Full code:

df = pd.DataFrame(np.random.randint(20, 38, [1000000,3]), columns=['a', 'b', 'c'])

mask1 = df['b'] < df['a']
mask2 = df['c'] < df['a']
mask3 = (mask1 == False) & (mask2 == False)

df.loc[mask1,'flag'] = True

df.loc[mask3,'flag'] = False

df['flag'].ffill(inplace=True)

df['flag'] = df['flag'].fillna(False)
run-out
  • 3,114
  • 1
  • 9
  • 25
  • 1
    i think this might work, i was looking at mask haven't looked at it before was trying with numpy. Will let u know thanks for looking at it – getting_there Apr 14 '19 at 02:47
  • 1
    Thanks @run-out this worked. It took me a while to test since the actual problem had about 6 scenarios when the graph changed positions. Thanks – getting_there Apr 14 '19 at 13:18
  • My pleasure, it was an interesting problem for me because I had the same situation'ish, which i previously showed you my solution. This is much better. Too bad my previous code is done now..... – run-out Apr 14 '19 at 13:22