0

I have a big dataframe with many columns. For simplicity, lets say:

df_sample = pd.DataFrame({'a':np.arange(10)})

I need to define a new column in df_sample (say column 'b') which needs to use some interpolation function, the argument of which is to be taken from column 'a'.

Now, the problem is that the interpolation function is different for each row. For each row, I interpolate from a different 1D grid; so I have different interpolation function for each row. So, what I did was to generate these interpolation functions before-hand and store them into an array. Just to given an example, showing below code to generate a sample-array 'list_interpfns'

list_interpfns = np.array([None]*10)
for j in range(10):
    list_interpfns[j] = scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10),np.linspace(0,50,10))

To generate df_sample.b[j], I need to use the list_interpfns[j], with the argument df_sample.a[j]. Since I am not able to directly apply a column formula for this purpose, I put this inside a loop.

df_sample['b'] = 0
for j in range(10):
    df_sample.loc[j,'b'] = list_interpfns[j](df_sample.a[j])

The problem is that this operation takes a lot of time. In this small example, the computation might seem fast. But my actual program is much larger, and when I was comparing the time taken for all operations, this particular sequence of operation took 84% of the total time; and I need to speed this up.

If there is some way to avoid the for loop (like using df.apply or something), then I believe it could reduce the operation time. Could you give possible alternatives?

joseph praful
  • 171
  • 1
  • 16

1 Answers1

1

Consider avoiding the multiple for loops and bookkeeping of initializing and updating arrays and series, and pass column values into function build and function argument using Series.apply():

def interp_(j):
    return scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10), np.linspace(0,50,10))

df_sample['b_'] = df_sample['a'].apply(lambda x: interp_(x)(x))

Results replicate your original

df_sample
#    a         b        b_
# 0  0  0.000000  0.000000
# 1  1  2.500000  2.500000
# 2  2  3.333333  3.333333
# 3  3  3.750000  3.750000
# 4  4  4.000000  4.000000
# 5  5  4.166667  4.166667
# 6  6  4.285714  4.285714
# 7  7  4.375000  4.375000
# 8  8  4.444444  4.444444
# 9  9  4.500000  4.500000

And timings suggest slightly faster processing even though Series.apply() is still a loop:

def run1():
    list_interpfns = np.array([None]*10)
    for j in range(10):
        list_interpfns[j] = scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10),
                                                       np.linspace(0,50,10))            
    df_sample['b'] = 0
    for j in range(10):
        df_sample.loc[j,'b'] = list_interpfns[j](df_sample.a[j])

def run2():
    def interp_(j):
        return scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10), np.linspace(0,50,10))

    df_sample['b_'] = df_sample['a'].apply(lambda x: interp_(x)(x))

if __name__=='__main__':
    from timeit import Timer

    f1 = Timer("run1()", "from __main__ import run1")
    res1 = f1.repeat(repeat=100, number=1)
    print('LOOP: {}'.format(np.mean(res1)))

    f2 = Timer("run2()", "from __main__ import run2")
    res2 = f2.repeat(repeat=100, number=1)
    print('APPLY: {}'.format(np.mean(res2)))

# LOOP: 0.006322918700000002
# APPLY: 0.0015046094699999867
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for the reply. Unfortunately, my actual interpolation function is much more complicated, and I cannot easily define it within a separate interp_ function like how you have done. I cannot avoid the for loop for creation of the array list_interpfns. Will it be possible to use the df.apply() using two variables (in this case, df_sample['a'] and df_sample['list_interpfns'] – joseph praful Apr 25 '19 at 08:03
  • Ah yes, I found how to do df.apply() on multiple columns. Now my code is faster by a factor of almost 7 :) https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe/13337376 – joseph praful Apr 25 '19 at 09:28