0

I have a dataframe df of the following shape:

         gvkey   fyear      x1     x2  ...        x43        x44       x45       x46
0         1000  1973.0   3.277  0.297  ...        NaN        NaN       NaN       NaN
1         1000  1974.0   3.494  0.010  ...        NaN        NaN       NaN       NaN
2         1000  1975.0   5.335  0.180  ...  27.686086  28.185823 -0.606521  0.030211
3         1000  1976.0   7.143  0.348  ...  25.982902  28.552185 -0.995969  0.024171
4         1000  1977.0   3.503  0.234  ...  16.120394  14.017921  2.231234  0.010640
       ...     ...     ...    ...  ...        ...        ...       ...       ...
140008  184740  1983.0  12.728  0.000  ...  37.367403  35.477801 -2.472318  0.009437
140009  184740  1984.0   9.219  0.000  ...  29.819767  27.435230  3.948097  0.004553
140010  187839  1986.0  -0.016  0.000  ...        NaN        NaN       NaN       NaN
140011  187839  1987.0  -0.416  0.000  ...        NaN        NaN       NaN       NaN
140012  187839  1988.0  -0.925  0.000  ...  -0.578135  -0.647306  0.033489  0.445458

I am trying to speed up a function that I necessarily need to apply row by row, there is no other way. Also, it is not possible to vectorize it. Analyzing it with the line_profiler, I noticed that the most expensive computations are represented by some np.where() (np for numpy) that I am using. They all have the following form:

function(row_of_df):
....
....say I have a partial output, Z, after some computations on a specific row
    key=row_of_df['gvkey']
    year=row_of_df['fyear']
    df['x45']=np.where((df['gvkey']==key)&(df['fyear']==year), Z, df['x45'])

which means I basically perform computations on a certain row and I get Z, and then the in that row, the value of df['x45'] becomes Z and otherwise satys what it is. The procedure cannot avoid doing this, as I need to apply the function row by row and to update the value of df['x45'] iteratively and following the order of the rows. There is no other option.

My question is: is there an alternative to np.where() in this case that is faster? Any suggestion is welcomed

user9875321__
  • 195
  • 12
  • Are you sure the most expensive computation is inside the np.where? – Dani Mesejo Dec 03 '20 at 17:07
  • Usually `np.where` is quite fast. – Mayank Porwal Dec 03 '20 at 17:08
  • does this helps: https://stackoverflow.com/questions/33281957/faster-alternative-to-numpy-where? – Pygirl Dec 03 '20 at 17:09
  • What is this line `df['x45']=np.where((df['gvkey']==key)&(df['fyear']==year), Z, df['x45'])` suppose to do? for each row you rewrite the entire d['x45'] column? – Dani Mesejo Dec 03 '20 at 17:10
  • As commented, `np.where` is pretty fast. I bet this is the problem: ` key=row_of_df['gvkey']; year=row_of_df['fyear']` I bet you are looping over some dataframe, which is generally slow. And on top of that, you try to modify the data on every single iteration. – Quang Hoang Dec 03 '20 at 17:13
  • @QuangHoang I know, but I have no other options. The functions I am working on are really complex to implement and the best way I have to procede is indeed this one. Otherwise I simply cannot achieve what I need – user9875321__ Dec 03 '20 at 17:21
  • @Matteo I seriously doubt that. There are certainly alternatives, e.g. `merge` or `groupby` to map/replace values without looping over the unique values. – Quang Hoang Dec 03 '20 at 17:23
  • @DaniMesejo well, I would rather skip the part where I rewrite the column, but ```numpy.where``` asks me to specify a second "to-do" task if I use it to make some changes to the dataframe. Would something like ```np.where((df['gvkey']==key)&(df['fyear']==year)``` alone to get the indices I need and only after that changing the values according to what I need be a better solution? – user9875321__ Dec 03 '20 at 17:24
  • Last comment: what you are asking seems to be an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Quang Hoang Dec 03 '20 at 17:28
  • @QuangHoang Unfortunately neither ```merge``` nor ```groupby``` can help in this case. This has nothing to do about the XY problem. I am simply asking if there is a faster alternative to ```np.where()``` for that task. That is my problem. I am seeking a solution, that's why I asked here – user9875321__ Dec 03 '20 at 17:41

0 Answers0