4

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.

There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :

// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
    Console.WriteLine("Performing KNN with K = "+K);

    var testResults = new int[testSamples.Count()]; 

    var testNumber = testSamples.Count();
    var trainNumber = trainSamples.Count();
    // Declaring these here so that I don't have to 'new' them over and over again in the main loop, 
    // just to save some overhead
    var distances = new double[trainNumber][]; 
    for (var i = 0; i < trainNumber; i++)
    {
       distances[i] = new double[2]; // Will store both distance and index in here
    }

    // Performing KNN ...
    for (var tst = 0; tst < testNumber; tst++)
    {
        // For every test sample, calculate distance from every training sample
        Parallel.For(0, trainNumber, trn =>
        {
            var dist = GetDistance(testSamples[tst], trainSamples[trn]);
            // Storing distance as well as index 
            distances[trn][0] = dist;
            distances[trn][1] = trn;
        });

        // Sort distances and take top K (?What happens in case of multiple points at the same distance?)
        var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);

        // Do a 'majority vote' to classify test sample
        var yea = 0.0;
        var nay = 0.0;

        foreach (var voter in votingDistances)
        {
            if (trainClasses[(int)voter[1]] == 1)  
               yea++;
            else
               nay++;
        }
        if (yea > nay)
            testResults[tst] = 1;
        else
            testResults[tst] = 0;

    }

    return testResults;
}

// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
    var distance = 0.0;
    // assume sample1 and sample2 are valid i.e. same length 

    for (var i = 0; i < sample1.Count; i++)
    {   
        var temp = sample1[i] - sample2[i];
        distance += temp * temp;
    }
    return distance;
}

This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?

I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.

I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.

I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.

Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

Community
  • 1
  • 1
ubuntunoob
  • 251
  • 1
  • 6
  • 16
  • 2
    It seems that your code currently works, and you are looking to improve it. Generally these questions are too opinionated for this site, but you might find better luck at [CodeReview.SE](http://codereview.stackexchange.com/tour). Remember to read [their requirements](http://codereview.stackexchange.com/help/on-topic) as they are a bit more strict than this site. – gunr2171 Jul 07 '14 at 17:40
  • I didn't know about that, thanks @gunr2171, I'll try there too. However I still think is a valid question for this site as well because I was hoping to get a discussion on perhaps using a different data structure (like KD-trees) for this problem, like in the stackoverflow post I've linked. – ubuntunoob Jul 07 '14 at 17:45
  • http://programmers.stackexchange.com may be better. Looking for alternative algorithms is a borderline "too broad" for SO. Check out related questions - sometimes solution is already there for some other language. – Alexei Levenkov Jul 07 '14 at 17:51
  • Will try that too @AlexeiLevenkov, thanks. I'm still searching for a good up-to-date discussion about this. – ubuntunoob Jul 07 '14 at 17:54
  • possible duplicate of [Nearest neighbors in high-dimensional data?](http://stackoverflow.com/questions/5751114/nearest-neighbors-in-high-dimensional-data) – Matthew Finlay Jul 07 '14 at 23:31
  • @MatthewFinlay thanks for linking that, I have seen it. The discussion is 3 years old on that question so I was hoping for someone to have gained more insight into this question since then. Also that question focuses on solutions that are not relevant to me, like dimensionality reduction or approximation. – ubuntunoob Jul 09 '14 at 14:46
  • For the record, I got a bunch of useful suggestions [here](http://codereview.stackexchange.com/questions/56367/how-to-optimize-k-nearest-neighbours-in-c-for-large-number-of-dimensions) – ubuntunoob Jul 17 '14 at 19:32
  • 1
    I am currently working on a C# module to optimize K-nearest neighbor searches in high dimensional problems (10 to 1000 dimensions). I am having excellent success using Hilbert Curves. For K=50 neighbors, 200 dimensions, 10,000 points, I get 40 times speedup over the linear scan. Map n-D point to 1-D Hilbert index, perform binary search, then sort the smaller list using the distance function. See this article: J. Shepherd, X. Zhu, and N. Megiddo. “A Fast Indexing Method for Multidimensional Nearest Neighbor Search”. – Paul Chernoch Oct 23 '14 at 17:08

1 Answers1

3

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.

That being said, a few things come to mind in reading your implementation:

  1. You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.

  2. Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.

  3. How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Community
  • 1
  • 1
dbc
  • 104,963
  • 20
  • 228
  • 340
  • Thanks for the suggestions @dbc, I did use the Visual Studio profiler. It showed me that 61% of the runtime is being spent in the GetDistance() function. I also tried changing the Parallel.For loop to include the code of the GetDistance() function instead of a call to the function, which saved me a few seconds at the cost of readability. K is 10, which is quite small, so I will try your other suggestions also. – ubuntunoob Jul 09 '14 at 14:44