0

I have a seven column data frame that contains many thousands of rows. Below is a sample. The seventh column is populated dynamically based on the values in the other six columns in the current row and the row above it.

The nested np.where statement I came up with works but it's not very flexible or easy to read. I explored using np.select but I don't have a finite set of values ex-ante that I can assign. Technically, they can be any real number.

So I'm wondering if there's a better way to perform this type of conditional vectorized operation given the constraint?

df['G'] = np.where(df['D'].eq(0) & df['D'].shift(1).eq(0) | 
             df['D'].ne(0) & df['D'].shift(1).ne(0), df['C'], 
             np.where(df['D'].eq(0) & df['D'].shift(1).eq(-1) & 
                      (df['B'] <= df['E'].shift(1)) & (df['E'].shift(1) <= df['A']), df['E'].shift(1), 
             np.where(df['D'].eq(0) & df['D'].shift(1).eq(-1) & 
                      (df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B']), df['F'].shift(1),
             np.where(df['D'].eq(0) & df['D'].shift(1).eq(1) & 
                      (df['B'] <= df['E'].shift(1)) & (df['E'].shift(1) <= df['A']), df['E'].shift(1),
             np.where((df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B']), 
                               df['F'].shift(1), df['C'])))))

df

                 A       B       C    D       E       F       G
Date                                                           
2021-05-17  126.93  125.17  125.90  0.0    0.00    0.00  125.90
2021-05-18  126.99  124.78  124.48  0.0    0.00    0.00  124.48
2021-05-19  124.92  122.86  124.32  1.0  122.86  135.15  124.32
2021-05-20  127.72  125.10  126.93  1.0  122.86  135.15  126.93
2021-05-21  128.00  125.21  125.06 -1.0  128.00  115.20  125.06
2021-05-24  127.94  125.94  126.72 -1.0  128.00  115.20  126.72
2021-05-25  128.32  126.32  126.53  0.0    0.00    0.00  128.00
2021-05-26  127.39  126.42  126.48  0.0    0.00    0.00  126.48
2021-05-27  127.64  125.08  124.91  0.0    0.00    0.00  124.91
2021-05-28  125.80  124.55  124.24 -1.0  125.80  113.22  124.24
2021-06-01  125.35  123.94  123.91 -1.0  125.80  113.22  123.91
2021-06-02  125.24  124.05  124.69  1.0  124.05  136.46  124.69
2021-06-03  124.85  123.13  123.18  0.0    0.00    0.00  124.05
2021-06-04  126.16  123.85  125.52  1.0  123.85  136.23  125.52
2021-06-07  126.32  124.83  125.53  1.0  123.85  136.23  125.53
2021-06-08  128.46  126.21  126.37  1.0  123.85  136.23  126.37
jgg
  • 791
  • 4
  • 17

1 Answers1

3

Just stripping the multiple np.where conditions and combining them in a single np.select produces the same outcome.

df['G'] = np.select([df['D'].eq(0) & df['D'].shift(1).eq(0) | df['D'].ne(0) & df['D'].shift(1).ne(0),
                      df['D'].eq(0) & df['D'].shift(1).eq(-1) & (df['B'] <= df['E'].shift(1)) & (df['E'].shift(1) <= df['A']),
                      df['D'].eq(0) & df['D'].shift(1).eq(-1) & (df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B']),
                      df['D'].eq(0) & df['D'].shift(1).eq(1) & (df['B'] <= df['E'].shift(1)) & (df['E'].shift(1) <= df['A']), 
                      (df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B'])], 
                     [df['C'], df['E'].shift(1), df['F'].shift(1), df['E'].shift(1), df['F'].shift(1)], df['C'])

In fact, some of your conditions are redundant. For example, the condition of the third np.where is:

df['D'].eq(0) & df['D'].shift(1).eq(-1) & (df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B'])

and the condition of the fifth np.where is

(df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B'])

and both assign df['F'].shift(1) when True. Since the former implies the latter and both select the same value, the former is redundant.

There are other cases where two separate conditions select the same value (those can be combined etc).

Then we can use abs and between methods to further simplify the conditions, e.g.

df['D'].shift(1).eq(-1) | df['D'].shift(1).eq(1) 

can be written as

df['D'].shift(1).abs().eq(1)

and

(df['A'] >= df['F'].shift(1)) & (df['F'].shift(1) >= df['B'])

can be written as

 df['F'].shift(1).between(df['B'], df['A'])

So the above np.select can be further simplified as:

df['G_np_select'] = np.select([df['D'].shift().abs().eq(1) & df['D'].eq(0) & 
                               df['E'].shift().between(df['B'], df['A']),
                               df['F'].shift().between(df['B'], df['A'])
                              ], 
                              [df['E'].shift(), df['F'].shift()], 
                              df['C'])

Output:

                 A       B       C    D       E       F       G  G_np_select
Date   
2021-05-17  126.93  125.17  125.90  0.0    0.00    0.00  125.90       125.90
2021-05-18  126.99  124.78  124.48  0.0    0.00    0.00  124.48       124.48
2021-05-19  124.92  122.86  124.32  1.0  122.86  135.15  124.32       124.32
2021-05-20  127.72  125.10  126.93  1.0  122.86  135.15  126.93       126.93
2021-05-21  128.00  125.21  125.06 -1.0  128.00  115.20  125.06       125.06
2021-05-24  127.94  125.94  126.72 -1.0  128.00  115.20  126.72       126.72
2021-05-25  128.32  126.32  126.53  0.0    0.00    0.00  128.00       128.00
2021-05-26  127.39  126.42  126.48  0.0    0.00    0.00  126.48       126.48
2021-05-27  127.64  125.08  124.91  0.0    0.00    0.00  124.91       124.91
2021-05-28  125.80  124.55  124.24 -1.0  125.80  113.22  124.24       124.24
2021-06-01  125.35  123.94  123.91 -1.0  125.80  113.22  123.91       123.91
2021-06-02  125.24  124.05  124.69  1.0  124.05  136.46  124.69       124.69
2021-06-03  124.85  123.13  123.18  0.0    0.00    0.00  124.05       124.05
2021-06-04  126.16  123.85  125.52  1.0  123.85  136.23  125.52       125.52
2021-06-07  126.32  124.83  125.53  1.0  123.85  136.23  125.53       125.53
2021-06-08  128.46  126.21  126.37  1.0  123.85  136.23  126.37       126.37