2

[EDIT: I understand that it is faster also because the function is written in C, but I want to know if It does a brute force search on all the training instances or something more sophisticated ]

I'm implementing in R, for studying purpose, the KNN algorithm. I'm also checking the code correctness by comparison with the caret implementation.

The problem lies on the execution time of the two versions. My version seems to take a lot of time, instead the caret implementation is very fast (even with crossvalidation with 10 folds).

Why? I'm calculating every euclidean distance of my test instances from the training ones. Which means that I'm doing NxM distance calculation (where N are my test instances, and M my training instances):

for (i in 1:nrow(test)){
    distances <- c()
    classes <- c()
    for(j in 1:nrow(training)){
        d = calculateDistance(test[i,], training[j,])
        distances <- c(distances, d)
        classes <- c(classes, training[j,][[15]])
    }
}

Is the caret implementation using some approximate search? Or an exact search, for example with the kd-tree? How can I speed up the search? I got 14 features for the problem, but I've been reading that the kd-tree is suggested for problem with 1 to 5 features.

EDIT:

I've found the C function called by R (VR_knn), which is pretty complex for me to understand, maybe someone can help.

Anyway I've written on the fly a brute force search in cpp, which seems to go faster than my previous R version, (but not fast as the caret C version) :

#include <Rcpp.h>
using namespace Rcpp;

double distance(NumericVector x1, NumericVector x2){
     int vectorLen = x1.size();
     double sum = 0;
     for(int i=0;i<vectorLen-1;i++){
         sum = sum + pow((x1.operator()(i)-x2.operator()(i)),2);
     }
     return sqrt(sum);

}

// [[Rcpp::export]]
void searchCpp(NumericMatrix training, NumericMatrix test) {
    int numRowTr = training.rows();
    int numColTr = training.cols();
    int numRowTe = test.rows();
    int numColTe = test.cols();
    for (int i=0;i<numRowTe;i++)
    {
        NumericVector test_i = test.row(i);
        NumericVector distances = NumericVector(numRowTe);
        for (int j=0;j<numRowTr;j++){
            NumericVector train_j = training.row(j);
            double dist = distance(test_i, train_j);
            distances.insert(i,dist);
        }
    }
}
Nikaido
  • 4,443
  • 5
  • 30
  • 47
  • Have you looked to see what the caret knn code is doing? – jmuhlenkamp Dec 03 '17 at 14:45
  • @jmuhlenkamp I'm searching for the code, but I have problem to find it (I'm new to R and caret) – Nikaido Dec 03 '17 at 14:50
  • 1
    Try `print(class::knn)` or `View(class::knn)` (if in Rstudio) – jmuhlenkamp Dec 03 '17 at 15:04
  • as I can see from the code, there is a calling to a .c function, which I don't know what is doing .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr), as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)), as.double(test), res = integer(nte), pr = double(nte), integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all)) – Nikaido Dec 03 '17 at 15:12
  • Yes, there is likely the basic answer to your question of why it runs much much faster (even if you are doing the same thing). Re-implementing it exactly as it is done in C with R code would likely run much much slower. What is your overall goal? – jmuhlenkamp Dec 03 '17 at 15:15
  • If you're still interesting in getting to the underlying C code, this answer may help: https://stackoverflow.com/a/19226817/6850554 – jmuhlenkamp Dec 03 '17 at 15:17
  • so basically vr_knn is doing a basic MxN search? My question, as is stated from the post, is asking what method is using the knn implementation written in R/C. If is working as an approximation search or if is using a kd-tree search, or something else. – Nikaido Dec 03 '17 at 15:17
  • which is pretty long and complex for me to read, and there isn't a good documentation about the code – Nikaido Dec 03 '17 at 15:42
  • If you want to use KD-trees from R try the RANN package. – gdkrmr Dec 03 '17 at 16:58
  • @gdkrmr I've seen that library, but I'd like to use it if needed and if it works with #features > 5 – Nikaido Dec 03 '17 at 17:14

0 Answers0