2

EDIT: I am trying to model a dataset via kNN (caret package) classifier in r, but it runs for a very long time. eventually I am stopping it. Sometimes whwn I stop it, it says "use warnings() to see all warning messages". When I do that, it says "lots of ties" for each column in the data set. I found a few solutions for this problem in here but none of them is worked in my situation. They say "put some pseudo-random noise data into data set, and it will work". I tried it, and didn't work:

https://stats.stackexchange.com/questions/25926/dealing-with-lots-of-ties-in-knn-model

END EDIT.

Thats why I am giving my train dataset's link to you guys, so maybe one of you can understand why kNN stucks when modeling it:

http://www.htmldersleri.org/train.csv (It is well-know Reuters-21578 dataset)

And here is the kNN r line:

knn<-train(as.factor(class)~.,data=as.matrix(train),method="kNN")

or

knn<-train(as.factor(class)~.,data=train,method="kNN")

none of them is working.

By the way, instead of kNN, using svmLinear does not work either.

And an Important note: I applied unique() function on all the columns and I noticed that there isn't any column that has only one value. They all are varies.

Lastly, here is the dataset information part in my project report, might be useful:

In Reuters-21578 dataset, we used top ten classes; 7269 samples in training set and 2686 samples in test set. The distribution of the classes is unbalanced. The maximum class has 2899 documents, occupying 39,88% of training set. The minimum class has 113 documents, occupying 1,55% of training set. Table I shows the ten most frequent categories along with the number of training and test set examples in each.

Dataset

Community
  • 1
  • 1
Whcrs
  • 61
  • 1
  • 8
  • 2
    `does not work` - what does not work? Are you getting an error message or does it just run for a very long time? – TayTay May 06 '16 at 13:01
  • @Tgsmith61591 actuall yes. it runs for a very long time. – Whcrs May 06 '16 at 13:55
  • @jogo I edited my question. can u check again please? – Whcrs May 06 '16 at 15:54
  • caret can handle ties... it's not that. Clustering techniques like this are extremely computationally intensive. How long is it running? – TayTay May 06 '16 at 20:01

1 Answers1

1

Why kNN and SVM are slow on this dataset

Both k-Nearest Neighbours (kNN) and Support Vector Machines (SVM, which you mention later in your question) grow in complexity significantly as the sample size (n) increases. Broadly, you can think of this as quadratic O(n^2) growth rate in Big O notation, although the truth is more nuanced than that.

In the case of kNN, the reason is quite easy to understand: it requires a train-n x test-n distance matrix to be constructed. What this means is that a kNN or SVM on a dataset of 1,415 examples will likely take twice as long as the same operation on a dataset of 1,000 examples, because time is a function of train-n x test-n.

So there is an upper limit on what size of dataset these algorithms can handle in-memory, which is considerably lower than the limit for, say, a logistic regression or gradient boosting machine.

Improving speed

Splitting your data into groups will give you quadratic performance improvements. For example, splitting an n=2000 dataset randomly into 2 n=1000 datasets, will give you a 4-fold performance improvement.

Dealing with ties

More importantly, though, it looks like you are attempting to train your model on categorical input data. kNN is designed to work with numeric input data. Categorical variables must be recoded as dummy variables. You aren't getting an error message, because the categorical variables are stored as numeric codes, but that doesn't mean the numbers are mathematically meaningful for kNN's distance calculations.

You are getting lots of ties because your dataset contains many categorical variables encoded as integers, with relatively few possible values. You could handle this in a couple of ways:

  1. Run a correspondence analysis on the categorical variables, and then run your kNN on the values returned by the correspondence analysis (which are continuous and orthogonal). The FactoMineR library has a well-documented function MCA for multiple correspondence analysis.

  2. Convert each factor to dummy variables (as explained in the link above) and then use a distance metric which is better suited to sparse data - cosine similarity. Unfortunately, you cannot run kNN with cosine similarity from caret, but it is possible to implement manually, as explained in this SO thread.

Community
  • 1
  • 1
ajrwhite
  • 7,728
  • 1
  • 11
  • 24