0

I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuations are complete but not all properties have floor areas.

I want to interpolate using some nearest neighbours method to approximate for the specific missing NaN values in the table. My software is written in Python, so probably requires using Numpy, Pandas, Scipy or some combination.

I've had a look at using SciPy's cKDTree, as well as some distance approximation using a Haversine formula to calculate distances, but all the examples I've seen are about interpolating across a plane rather than for gap-filling missing data, and I'm a bit at a loss as to how to achieve this.

As an example, here's the first few rows of what I've been using as test data (ratio is simply value/area):

lat       | long      | value | area  | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650  | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850  | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN   | NaN
57.101893 | -2.239883 | 31000 | NaN   | NaN
57.101383 | -2.238955 | 28750 | NaN   | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216

The properties themselves can be further grouped to get better relationships (this isn't part of the test data, but each property can be used for a different purpose, e.g. office, factory, shop).

I realise I can loop through this slowly, getting groups of properties by distance apart (testing each NaN property against the rest) but that would seem to be heartbreakingly glacial.

df.to_clipboard() output (first 15 rows):

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   NaN     NaN
9   57.101893   -2.239883   31000   NaN     NaN
10  57.101383   -2.238955   28750   NaN     NaN
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216
Alex M
  • 2,756
  • 7
  • 29
  • 35
Turukawa
  • 155
  • 2
  • 11
  • can you post your exemple using df.to_clipboard() instead ? may be cleaner. Also do you want everything from scratch or are you open to use libraries ? – Tbaki Sep 21 '17 at 13:09
  • Happy to use libraries @Tbaki, and I've edited the question to paste in the df.to_clipboard() result. Not sure if that helps though? – Turukawa Sep 21 '17 at 13:45
  • it's just easier to copy past using df.read_clipboard() for exemple, i put some code together, hop it help – Tbaki Sep 21 '17 at 13:51
  • If you use lat/lon, you should make your own distance matrix using the Haversine Formula for exemple, but the logic is there – Tbaki Sep 21 '17 at 13:57
  • OK @Tbaki, I'm assuming the weird output is happening because of using only one coordinate? – Turukawa Sep 21 '17 at 14:10
  • We use both coordinate, but because of the curve of the earth, the euclidean distance is only the straight line, so not as accurate, the Haversine Formula is an approxiamation of that. – Tbaki Sep 21 '17 at 14:14

1 Answers1

1

If you are open to libraries, you can use a Distance matrix

Assuming df your main dataframe

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

def find_closest(x, df):
    #Supress itself
    d = x.drop(x.name).to_dict()
    #sort the distance
    v = sorted(d, key=lambda k: d[k])
    #Find the closest with a non nan area value else return NaN
    for i in v :
        if i in df[~df.area.isnull()].index:
            return df.loc[i].ratio
        else:
            pass
    return np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the values
for k,v in res.items():
    df.loc[k,"ratio"] = v
    df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]

The result

    lat         long        value   area    ratio
0   57.101474   -2.242851   12850   252.0   50.992063
1   57.102554   -2.246308   14700   309.0   47.572816
2   57.100556   -2.248342   25600   507.0   50.493097
3   57.101765   -2.254688   28000   491.0   57.026477
4   57.097553   -2.245483   5650    119.0   47.478992
5   57.098244   -2.245768   43000   811.0   53.020962
6   57.098554   -2.252504   46300   850.0   54.470588
7   57.102794   -2.243454   7850    180.0   43.611111
8   57.101474   -2.242851   26250   514.0   50.99206349
9   57.101893   -2.239883   31000   607.0   51.00502513
10  57.101383   -2.238955   28750   563.0   51.00502513
11  57.104578   -2.235641   18500   327.0   56.574924
12  57.105424   -2.234953   21950   406.0   54.064039
13  57.105516   -2.233683   19600   408.0   48.039216
Tbaki
  • 1,013
  • 7
  • 12
  • OK, it's a cool approach, however, I ran it and checked it against the original data (since I deleted a few fields for the test): ,lat,long,value,area_calc,area_NaN,area_original 8,57.101474,-2.242851,26250,252,,477 9,57.101893,-2.239883,31000,398,,633 10,57.101383,-2.238955,28750,398,,575 11,57.103205,-2.239743,28350,27,,525 The variance is quite large: 0.892857143 0.590452261 0.444723618 18.44444444 – Turukawa Sep 21 '17 at 14:02
  • @Turukawa Do you think it's because of the Euclidean distance ? – Tbaki Sep 21 '17 at 14:06
  • it may be ... I'm still trying to understand what it's doing, though. ... df_matrix_distance creates a grid of the distances between any two points from the original df, then df_nan is a subset of df with only the NaNs ... then you index that and return the nearest area value by distance, then turn that into a value for the NaN? – Turukawa Sep 21 '17 at 14:22
  • just looking at find_closest ... where you use 'area', should actually be 'ratio' ... then the last two lines should be `df.loc[k,"ratio"] = v df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]` and then it works ... – Turukawa Sep 21 '17 at 14:29
  • the trick is that the ratios give us the linear relationship (tying to the continuous values) to use for the extrapolation ... this is actually awesome ... thanks :) – Turukawa Sep 21 '17 at 14:31
  • @Turukawa glad you find your solution. : D – Tbaki Sep 21 '17 at 14:51