0

I have a fairly large dataframe on which I want to

  1. row-wise search for an interval containing a value
  2. perform a linear interpolation between the two elements found at point 1 and two elements from another array
  3. add a column to the dataframe with the interpolated values

What I have done involves a for loop, i.e.:

Given a sample of the dataframe Fak

     beta0     beta1     beta2     beta3     beta4     beta5     beta6     beta7     beta8     beta9    beta10
0    0.008665  0.061391  0.159690  0.223275  0.232535  0.251266  0.279847  0.465671  0.672253  0.914753 1.0
1    0.009121  0.064322  0.166623  0.232418  0.241945  0.261106  0.290169  0.477621  0.682283  0.916384 1.0
2    0.009491  0.066689  0.172210  0.239776  0.249516  0.269020  0.298463  0.487108  0.690031  0.917638 1.0
3    0.009733  0.068232  0.175837  0.244542  0.254418  0.274140  0.303820  0.493102  0.694703  0.918304 1.0
4    0.009860  0.069027  0.177687  0.246963  0.256906  0.276734  0.306523  0.495985  0.696696  0.918511 1.0

I have an array psi

[-12.97, -11.97, -10.97, -9.97, -8.97, -7.97, -6.97, -5.97, -4.97, -3.97, -2.97, -1.97]

I define the value I want to search in Fak, i.e. intF = 0.16

I calculate the new dataframe with the following loop

dxlist = []
for i,Faki in Fak.iterrows():
    # interpolation boundaries ID
    if intF == 0.0:
        ip1 = 1
    elif intF == 1.0:
        ip1 = -1
    else:
        ip1 = np.where(Faki>int(intF)/100)[0][0]
    im1 = ip1-1
    # coefficients
    dfak = Faki[ip1] - Faki[im1]
    dpsi = psi[ip1] - psi[im1]
    m = dfak/dpsi
    q = Faki[im1]-m*psi[im1]
    # calculate
    intPsi = (int(intF)/100-q)/m
    intDi  = 2**intPsi
    dxlist.append(intDi)
dfout['newcolumn'] = dxlist

which works, but it is quite slow.

What I am missing is how to calculate the linear interpolation row-wise and use the indices on an outside array.

brodegon
  • 231
  • 2
  • 12
  • 1
    Have you tried using .apply method? – Joao Donasolo Jan 24 '22 at 18:51
  • 1
    the `.iterrows` method is a really inefficient way of working with a dataframe, especially as it gets larger and larger. there are some better suggestions [here](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758) – Derek O Jan 24 '22 at 19:25
  • @JoaoDonasolo sure, but how? I mean I need to find an index over each row, apply it to an array and then calculate a new value to append to a new column. How can one wrap these three operation in one apply command? – brodegon Jan 25 '22 at 11:20
  • @DerekO sure, I'll have a look at that, but I think vectorization is not an option in this case. – brodegon Jan 25 '22 at 11:20

1 Answers1

0

Apparently I found a vectorized solution:

psidf = Fak.copy()
psidf.loc[Fak.index] = psi
Fakp1 = Fak[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
Fakm1 = Fak[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
psip1 = psidf[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
psim1 = psidf[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
m     = (Fakp1-Fakm1)/(psip1-psim1)
q     = Fakm1-m*psim1
intDi_series = 2**((intF/100-q)/m)
intDi['d'+str(int(intF))+nsfx] = intDi_series

The key is to generate a database with the array as rows, having the same shape as Fak (which is done in the first two lines of the above code).

Then, I isolate the columns I need from each dataframe using the ge and le methods for pandas dataframe, and I use the indices in the newly generated dataframe

brodegon
  • 231
  • 2
  • 12