3

I have a Dataframe with the location of some customers (so I have a column with Customer_id and others with Lat and Lon) and I am trying to interpolate the NaN's according to each customer.

For example, if I interpolate with the nearest approach here (I made up the values here):

 Customer_id   Lat    Lon
   A            1      1
   A            NaN    NaN  
   A            2      2      
   B            NaN    NaN
   B            4      4

I would like the NaN for B to be 4 and not 2.

I have tried this

series.groupby('Customer_id').apply(lambda group: group.interpolate(method = 'nearest', limit_direction = 'both'))

And the number of NaN's goes down from 9003 to 94. But I'm not understanding why it is still leaving some missing values.

I checked and these 94 missing values corresponded to records from customers that were already being interpolated. For example,

    Customer_id   Lat
 0.    A           1
 1.    A           NaN
 2.    A           NaN
 3.    A           NaN
 4.    A           NaN

It would interpolate correctly until some value (let's say it interpolates 1, 2 and 3 correctly) and then leaves 4 as NaN.

I have tried to set a limit in interpolate greater than the maximum number of records per client but it is still not working out. I don't know where my mistake is, can somebody help out?

(I don't know if it's relevant to mention or not but I fabricated my own NaN's for this. This is the code I used Replace some values in a dataframe with NaN's if the index of the row does not exist in another dataframe I think the problem isn't here but since I'm very confused as to where the issue actually is I'll just leave it here)

Nocas
  • 357
  • 1
  • 4
  • 14
  • 1
    sorry I'll change it in the question. I set it as a variable so that it was easier to change in the pipeline every time I run. But it was set to nearest – Nocas Jul 08 '19 at 19:31

1 Answers1

4

When you interpolate with nearest it is only able to fill in-between missing values. (You'll notice this because you get an error when there's only 1 non-null value, like in your example). The remaining null values are "edges" which are taken care of with .bfill().ffill() for the nearest logic. This is also the appropriate logic to "interpolate" with only one non-missing value.

def my_interp(x):
    if x.notnull().sum() > 1:
        return x.interpolate(method='nearest').ffill().bfill()
    else:
        return x.ffill().bfill()

df.groupby('Customer_id').transform(my_interp)

#   Lat  Lon
#0  1.0  1.0
#1  1.0  1.0
#2  2.0  2.0
#3  4.0  4.0
#4  4.0  4.0
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Is this the case only for the nearest method? Cubic also gives me troubles should I try to do something similar with that approach too? – Nocas Jul 08 '19 at 23:36
  • Typically interpolate is used to fill values in between. Outside values require you to extrapolate, which is typically done through curve fitting. – ALollz Jul 08 '19 at 23:54
  • @Nocas see https://stackoverflow.com/questions/22491628/extrapolate-values-in-pandas-dataframe – ALollz Jul 09 '19 at 00:04