I have a big dataframe with many columns. For simplicity, lets say:
df_sample = pd.DataFrame({'a':np.arange(10)})
I need to define a new column in df_sample (say column 'b') which needs to use some interpolation function, the argument of which is to be taken from column 'a'.
Now, the problem is that the interpolation function is different for each row. For each row, I interpolate from a different 1D grid; so I have different interpolation function for each row. So, what I did was to generate these interpolation functions before-hand and store them into an array. Just to given an example, showing below code to generate a sample-array 'list_interpfns'
list_interpfns = np.array([None]*10)
for j in range(10):
list_interpfns[j] = scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10),np.linspace(0,50,10))
To generate df_sample.b[j], I need to use the list_interpfns[j], with the argument df_sample.a[j]. Since I am not able to directly apply a column formula for this purpose, I put this inside a loop.
df_sample['b'] = 0
for j in range(10):
df_sample.loc[j,'b'] = list_interpfns[j](df_sample.a[j])
The problem is that this operation takes a lot of time. In this small example, the computation might seem fast. But my actual program is much larger, and when I was comparing the time taken for all operations, this particular sequence of operation took 84% of the total time; and I need to speed this up.
If there is some way to avoid the for loop (like using df.apply or something), then I believe it could reduce the operation time. Could you give possible alternatives?