How to avoid using for loops to create a conditional new column for a huge pandas dataframe?

Question

I have the following pandas df and I want to create another column called df['trade_percentage'] where by for every row, if df['trade'] == "ask", df['trade_percentage'] = df['trade_size']/df['ask_size'] else if df['trade'] == "bid", df['trade_percentage'] = df['trade_size']/df['bid_size']. Ignore NaN. I have written these codes:

However, I have 1 million rows in my dataframe and this has been running for a few hours and is still running. May I ask if there is any way for me to simplify it instead of using loops? Thank you!

for i in range(len(df)):
  if (df['trade'][i] == "ask"):
    df['trade_percentage'][i] = df['trade_size'][i]/df['ask_size'][i]
  elif (df['trade'][i] == "bid"):
    df['trade_percentage'][i] = df['trade_size'][i]/df['bid_size'][i]  
df.head()

             time            bid_size   bid       ask  ask_size trade trade_size phase  
0   2019-01-07 07:45:01.064515  495   152.52    152.54    19     NaN      NaN    OPEN   
1   2019-01-07 07:45:01.110072  31    152.53    152.54    19     NaN      NaN    OPEN   
2   2019-01-07 07:45:01.116596  32    152.53    152.54    19     NaN      NaN    OPEN   
3   2019-01-07 07:45:01.116860  32    152.53    152.54    21     NaN      NaN    OPEN   
4   2019-01-07 07:45:01.116905  34    152.53    152.54    21     NaN      NaN    OPEN   
5   2019-01-07 07:45:01.116982  34    152.53    152.54    31     NaN      NaN    OPEN   
6   2019-01-07 07:45:01.147901  38    152.53    152.54    31     NaN      NaN    OPEN   
7   2019-01-07 07:45:01.189971  38    152.53    152.54    31     ask     15.0    OPEN   
8   2019-01-07 07:45:01.189971  38    152.53    152.54    16     NaN      NaN    OPEN   
9   2019-01-07 07:45:01.190766  37    152.53    152.54    16     NaN      NaN    OPEN   
10  2019-01-07 07:45:01.190856  37    152.53    152.54    15     NaN      NaN    OPEN
11  2019-01-07 07:45:01.190856  37    152.53    152.54    16     ask      1.0    OPEN   
12  2019-01-07 07:45:01.193938  37    152.53    152.55   108     NaN      NaN    OPEN   
13  2019-01-07 07:45:01.193938  37    152.53    152.54    15     ask     15.0    OPEN   
14  2019-01-07 07:45:01.194326  2     152.54    152.55   108     NaN      NaN    OPEN   
15  2019-01-07 07:45:01.194453  2     152.54    152.55    97     NaN      NaN    OPEN   
16  2019-01-07 07:45:01.194479  6     152.54    152.55    97     NaN      NaN    OPEN   
17  2019-01-07 07:45:01.194507  19    152.54    152.55    97     NaN      NaN    OPEN   
18  2019-01-07 07:45:01.194532  19    152.54    152.55    77     NaN      NaN    OPEN   
19  2019-01-07 07:45:01.194598  19    152.54    152.55    79     NaN      NaN    OPEN

BENY · Answer 1 · 2019-04-19T02:23:01.250

1

I am using np.select

s1=df['trade'] == "ask"
s2=df['trade'] == "bid"
np.select([s1,s2],[df['trade_size']/df['ask_size'],df['trade_size']/df['bid_size']],default =np.nan)

edited Apr 19 '19 at 02:23

answered Apr 19 '19 at 02:20

BENY

317,841
20
164
234

will they be merge together into 1 dataframe or will I have 2 separate df where one is for bid and one is for ask? – atjw94 Apr 19 '19 at 02:21
@atjw94 it will return into one array , check the example https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html – BENY Apr 19 '19 at 02:23

How to avoid using for loops to create a conditional new column for a huge pandas dataframe?

1 Answers1