I have a dataframe df
of the following shape:
gvkey fyear x1 x2 ... x43 x44 x45 x46
0 1000 1973.0 3.277 0.297 ... NaN NaN NaN NaN
1 1000 1974.0 3.494 0.010 ... NaN NaN NaN NaN
2 1000 1975.0 5.335 0.180 ... 27.686086 28.185823 -0.606521 0.030211
3 1000 1976.0 7.143 0.348 ... 25.982902 28.552185 -0.995969 0.024171
4 1000 1977.0 3.503 0.234 ... 16.120394 14.017921 2.231234 0.010640
... ... ... ... ... ... ... ... ...
140008 184740 1983.0 12.728 0.000 ... 37.367403 35.477801 -2.472318 0.009437
140009 184740 1984.0 9.219 0.000 ... 29.819767 27.435230 3.948097 0.004553
140010 187839 1986.0 -0.016 0.000 ... NaN NaN NaN NaN
140011 187839 1987.0 -0.416 0.000 ... NaN NaN NaN NaN
140012 187839 1988.0 -0.925 0.000 ... -0.578135 -0.647306 0.033489 0.445458
I am trying to speed up a function that I necessarily need to apply row by row, there is no other way. Also, it is not possible to vectorize it. Analyzing it with the line_profiler
, I noticed that the most expensive computations are represented by some np.where()
(np for numpy) that I am using. They all have the following form:
function(row_of_df):
....
....say I have a partial output, Z, after some computations on a specific row
key=row_of_df['gvkey']
year=row_of_df['fyear']
df['x45']=np.where((df['gvkey']==key)&(df['fyear']==year), Z, df['x45'])
which means I basically perform computations on a certain row and I get Z
, and then the in that row, the value of df['x45']
becomes Z
and otherwise satys what it is. The procedure cannot avoid doing this, as I need to apply the function row by row and to update the value of df['x45']
iteratively and following the order of the rows. There is no other option.
My question is: is there an alternative to np.where()
in this case that is faster? Any suggestion is welcomed