2

From https://stackoverflow.com/a/35684975/4533188 I got that K-Nearest Neighbour Imputation works like this:

  1. For the current observation get the distance to all the other observations.
  2. For each missing value in the current observation, consider all those k nearest observations that have no missing value in the feature in question.
  3. From those feature values of those observations: Calculate the mean (or some similar statistic) - this is the value which is used for the imputation.

The key step is 1: How do we calculate the distance if not all values are available? The post above points towards the Heterogeneous Euclidean-Overlap Metric. However I am interested in the implementation of knn-imputation of fancyimpute. I tracked it back to https://github.com/hammerlab/knnimpute, more specifically https://github.com/hammerlab/knnimpute/blob/master/knnimpute/few_observed_entries.py and I looked at the code. However I am not able to figure out how it works.

Can someone please explain to me, how the knnimpute works there? How is does the distance calculation work here?

Community
  • 1
  • 1
Make42
  • 12,236
  • 24
  • 79
  • 155
  • 1
    Most commonly Euclidean distance, but other choices are available. – alexwhitworth Mar 05 '17 at 01:17
  • @AlexW: Euclidean distance of what exactly? Of the data observation in question and the respective other data observations but only taking those features that are available in both observations - thus with changing features per observation comparison? Where is that in the code? – Make42 Mar 06 '17 at 06:57
  • 1
    [alexwhitworth/imputation](https://github.com/alexWhitworth/imputation/blob/master/src/dist_calcs.cpp) – alexwhitworth Mar 06 '17 at 15:37

1 Answers1

1

What follow is specific to KNNImpute function from the Scikit-Learn Python Library. Doc: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

The parameter "metric" has "nan_euclidian" as default value. The documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

Intuitively, "nan-euclidian" distance computes the standard euclidian distance where possible (and nothing where either of both observations is missing), and scales the result linearly to compensate for missing entries.

Florian Lalande
  • 494
  • 4
  • 13