I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuations are complete but not all properties have floor areas.
I want to interpolate using some nearest neighbours method to approximate for the specific missing NaN
values in the table. My software is written in Python, so probably requires using Numpy, Pandas, Scipy or some combination.
I've had a look at using SciPy's cKDTree, as well as some distance approximation using a Haversine formula to calculate distances, but all the examples I've seen are about interpolating across a plane rather than for gap-filling missing data, and I'm a bit at a loss as to how to achieve this.
As an example, here's the first few rows of what I've been using as test data (ratio is simply value/area
):
lat | long | value | area | ratio
----------|-----------|-------|-------|----------
57.101474 | -2.242851 | 12850 | 252.0 | 50.992063
57.102554 | -2.246308 | 14700 | 309.0 | 47.572816
57.100556 | -2.248342 | 25600 | 507.0 | 50.493097
57.101765 | -2.254688 | 28000 | 491.0 | 57.026477
57.097553 | -2.245483 | 5650 | 119.0 | 47.478992
57.098244 | -2.245768 | 43000 | 811.0 | 53.020962
57.098554 | -2.252504 | 46300 | 850.0 | 54.470588
57.102794 | -2.243454 | 7850 | 180.0 | 43.611111
57.101474 | -2.242851 | 26250 | NaN | NaN
57.101893 | -2.239883 | 31000 | NaN | NaN
57.101383 | -2.238955 | 28750 | NaN | NaN
57.104578 | -2.235641 | 18500 | 327.0 | 56.574924
57.105424 | -2.234953 | 21950 | 406.0 | 54.064039
57.105516 | -2.233683 | 19600 | 408.0 | 48.039216
The properties themselves can be further grouped to get better relationships (this isn't part of the test data, but each property can be used for a different purpose, e.g. office, factory, shop).
I realise I can loop through this slowly, getting groups of properties by distance apart (testing each NaN
property against the rest) but that would seem to be heartbreakingly glacial.
df.to_clipboard()
output (first 15 rows):
lat long value area ratio
0 57.101474 -2.242851 12850 252.0 50.992063
1 57.102554 -2.246308 14700 309.0 47.572816
2 57.100556 -2.248342 25600 507.0 50.493097
3 57.101765 -2.254688 28000 491.0 57.026477
4 57.097553 -2.245483 5650 119.0 47.478992
5 57.098244 -2.245768 43000 811.0 53.020962
6 57.098554 -2.252504 46300 850.0 54.470588
7 57.102794 -2.243454 7850 180.0 43.611111
8 57.101474 -2.242851 26250 NaN NaN
9 57.101893 -2.239883 31000 NaN NaN
10 57.101383 -2.238955 28750 NaN NaN
11 57.104578 -2.235641 18500 327.0 56.574924
12 57.105424 -2.234953 21950 406.0 54.064039
13 57.105516 -2.233683 19600 408.0 48.039216