Weighted k-nearest neighbors and R kknn package

Question

I would like to understand how the R kknn package calculates weights, distances, and class probabilities for binary classification problems. In the R code below, there are three observations in the training sample and one observation in the holdout sample. The two predictor variables are height and weight. With Euclidean distance, the distances for each observation in the training sample are then:

sqrt((6-8)^2 + (4-5)^2) = 2.24

sqrt((6-3)^2 + (4-7)^2) = 4.24

sqrt((6-7)^2 + (4-3)^2) = 1.41.

With k=3 and with equal weights, I get a probability for the holdout as:

(1/3 * 1) + (1/3 * 0) + (1/3 * 1) = 0.67.

With k=2 and with equal weights, I get a probability for the holdout as:

(1/2 * 1) + (1/2 * 1) = 1.00.

I would like to understand how the R kknn package makes these same calculations with the "triangular," "gaussian," and "inverse" weights (and more generally).

library(kknn)

training <- data.frame(class = c(1, 0, 1), height = c(8, 3, 7), weight = c(5, 7, 3))

holdouts <- data.frame(class = 1, height = 6, weight = 4)

triangular_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "triangular", k = 3)

triangular_kernel[["fitted.values"]]

triangular_kernel[["W"]]

triangular_kernel[["D"]]

gaussian_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "gaussian", k = 3)

gaussian_kernel[["fitted.values"]]

gaussian_kernel[["W"]]

gaussian_kernel[["D"]]

inverse_kernel <- kknn(class ~., training, holdouts, distance = 2, kernel = "inv", k = 3)

inverse_kernel[["fitted.values"]]

inverse_kernel[["W"]]

inverse_kernel[["D"]]

Looks like the package is an implementation of paper written by the same author. On google I found a pdf version of the paper here (no idea if it is a legit copy or not): https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf. I don't know the answer to your question but can imagine you can find it within the paper. — gowerc, Jan 10 '21 at 16:43

Donald Seinen · Accepted Answer · 2021-01-13T15:42:49.743

Calling kknn::kknn prints the source code for the kknn function in the console. With it, one can go through the function line by line to see what it does.

Distance

kknn calls a compiled C code dmEuclid. To obtain its source code, we follow this guide, writing the following code in R:

untar(download.packages(pkgs = "kknn", destdir = ".", type = "source")[,2])

and then open the src directory of kknn_1.3.1.tar in your working directory (getwd()) to find and open dm.C using any text editor. Scroll about halfway to find dmEuclid. To test the exact outputs of dmEuclid, you could install the build tools, and open a C++ file in Rstudio by selecting it in the dropdown menu, and run the code with different inputs.

Following the function outputs, in your case the dmtmp$dm results in

3.779645e-01  1.133893e+00 1.000000e+150 3.685210e-156

Per your specification k, the first 3 values are chosen as distance D. This is manually converted to maxdist = 1e-06 by the package author, as the max distance is smaller than that in your case.

Weights

The kknn function uses the following section to allocate a weight scheme, per your defined kernel.

 W <- D/maxdist
 W <- pmin(W, 1 - (1e-06))
 W <- pmax(W, 1e-06)

at this point your W values are larger than 1, and so W is then coerced to approximately 1.

if (kernel == "inv" 
        W <- 1/W
if (kernel == "triangular") 
        W <- 1 - W
if (kernel == "gaussian") {
        alpha = 1/(2 * (k + 1))
        qua = abs(qnorm(alpha))
        W = W * qua
        W = dnorm(W, sd = 1)
    }

the explanation for which can be found in the paper linked by gowerc. W is then converted to matrix W <- matrix(W, p, k) with 1 row (p=1), 3 columns (k=3)

Fitted value

p = 1 in your case is 1, k=3, cl = c(1,0,1).

C <- matrix(dmtmp$cl, nrow = p, ncol = k + 1)
C <- C[, 1:k] + 1
CL <- matrix(cl[C], nrow = p, ncol = k)
W <- matrix(W, p, k)
fit <- rowSums(W * CL)/pmax(rowSums(W), 1e-06)

The write-up states, "To put equal weight on each covariate in computing the distances, one has to standardize the values." Standardization means dividing the variables by their standard deviation. I assume that these distances are what are given by dmtmp$dm. But when I calculate the standardized distances I get: d(x, x(1)) = sqrt(((6/sd(c(8,3,7))) - (8/sd(c(8,3,7))))^2 + ((4/sd(c(5,7,3))) - (5/sd(c(5,7,3))))^2) and d(x, x(2)) = sqrt(((6/sd(c(8,3,7))) - (3/sd(c(8,3,7))))^2 + ((4/sd(c(5,7,3))) - (7/sd(c(5,7,3))))^2), and so on. These do not match with dmtmp$dm, which is the first step. — user1491868, Jan 13 '21 at 17:51
From the article -- "The standardization of all kinds of covariates is only based on the observations from the learning set." `kknn` constructs `learn` through several steps, and makes vector of sds called `d.sd`., also through some steps. Then, ` sweep(learn, 2L, d.sd, "/", check.margin = FALSE)` is the line of code that divides values by `d.sd`. This all happens __before__ `dmtmp` is computed, as `dmEuclid` function takes `learn` as input variable. — Donald Seinen, Jan 13 '21 at 19:01
Thank you for your help and patience. Your Answer is the best thus far. But what I'm really hoping for is a direct answer to the following question: given the numbers in my example, please show me using math (not complicated computer code) how does the kknn function get its numbers? — user1491868, Jan 13 '21 at 19:38

Weighted k-nearest neighbors and R kknn package

1 Answers1

Distance

Weights

Fitted value