How does knnimpute work?

Question

From https://stackoverflow.com/a/35684975/4533188 I got that K-Nearest Neighbour Imputation works like this:

For the current observation get the distance to all the other observations.
For each missing value in the current observation, consider all those k nearest observations that have no missing value in the feature in question.
From those feature values of those observations: Calculate the mean (or some similar statistic) - this is the value which is used for the imputation.

The key step is 1: How do we calculate the distance if not all values are available? The post above points towards the Heterogeneous Euclidean-Overlap Metric. However I am interested in the implementation of knn-imputation of fancyimpute. I tracked it back to https://github.com/hammerlab/knnimpute, more specifically https://github.com/hammerlab/knnimpute/blob/master/knnimpute/few_observed_entries.py and I looked at the code. However I am not able to figure out how it works.

Can someone please explain to me, how the knnimpute works there? How is does the distance calculation work here?

Most commonly Euclidean distance, but other choices are available. — alexwhitworth, Mar 05 '17 at 01:17
@AlexW: Euclidean distance of what exactly? Of the data observation in question and the respective other data observations but only taking those features that are available in both observations - thus with changing features per observation comparison? Where is that in the code? — Make42, Mar 06 '17 at 06:57
[alexwhitworth/imputation](https://github.com/alexWhitworth/imputation/blob/master/src/dist_calcs.cpp) — alexwhitworth, Mar 06 '17 at 15:37

Florian Lalande · Accepted Answer · 2021-12-16T05:24:02.850

What follow is specific to KNNImpute function from the Scikit-Learn Python Library. Doc: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

The parameter "metric" has "nan_euclidian" as default value. The documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html

Intuitively, "nan-euclidian" distance computes the standard euclidian distance where possible (and nothing where either of both observations is missing), and scales the result linearly to compensate for missing entries.

How does knnimpute work?

1 Answers1