0

I am trying to predict values for a categorical variable using a KNN model in R.

To do this, I am using a function so that I can easily vary the dataset, % of observations, and k-value.

When I apply this function to a particular dataset though, I am getting an error.

EDIT: I am somewhat limited in terms of how reproducible I can make this question, however, I am adding the libraries so that it is clear what packages I am using.

The data I am using is structured like this:

library(dplyr)
library(class)
library(neuralnet)
library(nnet)
library(lubridate)

> head(crypto_data)
                 time btc_price eth_price block_size difficulty estimated_btc_sent estimated_transaction_volume_usd  hash_rate
1 2017-09-02 21:54:00  1.622181  1.710355  0.9502574  -1.258379        -0.05186039                        0.4346130 -0.7265456
2 2017-09-02 22:29:00  1.738889  1.970749  0.5771003  -1.258379        -0.07004424                        0.4110978 -1.0477347
3 2017-09-02 23:04:00  1.705891  1.938885  0.4726202  -1.258379        -0.10641195                        0.3755673 -0.9406717
4 2017-09-02 23:39:00  1.775354  2.159321  0.4144439  -1.258379        -0.14277966                        0.3348643 -0.8871402
5 2017-09-03 00:14:00  2.028195  2.572964  0.2132932  -1.258379        -0.10641195                        0.4305168 -1.0477347
6 2017-09-03 00:49:00  2.097871  2.504085  0.0190859  -1.258379        -0.14277966                        0.3756431 -1.1547978
  miners_revenue_btc miners_revenue_usd minutes_between_blocks n_blocks_mined n_blocks_total n_btc_mined        n_tx nextretarget
1          1.0287278           1.699011            -0.43408783     0.37556660      -2.016092  0.37464164  0.04072815     -2.22295
2          0.6856301           1.417137            -0.11622241     0.04004961      -2.015293  0.06154488 -0.12441993     -2.22295
3          0.7955973           1.507554            -0.22217755     0.15188860      -2.008898  0.15100110 -0.05626304     -2.22295
4          0.8395842           1.543490            -0.29923583     0.20780810      -2.005700  0.19572920 -0.10762521     -2.22295
5          0.6812315           1.519311            -0.06806098     0.04004961      -2.003302  0.06154488 -0.09733929     -2.22295
6          0.5580682           1.416853            -0.03916412    -0.07178939      -2.000904 -0.07263945 -0.19824250     -2.22295
  total_btc_sent total_fees_btc  totalbtc trade_volume_btc trade_volume_usd targetVar
1     -0.9319080       2.703601 -2.551107        0.2518994        0.5783353       buy
2     -0.9698475       2.632490 -2.551107        0.2518994        0.5783353       buy
3     -0.9698475       2.638365 -2.551107        0.2518994        0.5783353       buy
4     -1.0077870       2.594611 -2.551107        0.2518994        0.5783353       buy
5     -1.0077870       2.628309 -2.551107        0.1465798        0.4688573      hold
6     -1.0267568       2.568152 -2.551107        0.1465798        0.4688573      hold

The function is:

knn_predFunc <- function(inData, k, trainPct) {

  trainP <- trainPct * .6
  valP <- trainPct * .2
  testP <- trainPct * .2

  #SplitData
  trainObs <- sample(nrow(inData), trainP * nrow(inData), replace = FALSE)
  valObs <- sample(nrow(inData), valP * nrow(inData), replace = FALSE)
  testObs <- sample(nrow(inData), testP * nrow(inData), replace = FALSE)

  # Create the training/va/test datasets
  trainDS <- inData[trainObs,]
  valDS <- inData[valObs,]
  testDS <- inData[testObs,]

  # Separate the labels
  train_labels <- trainDS[,"targetVar"]

  # KNN
  knn_crypto_val_pred <- knn(trainDS, valDS, train_labels, k = k)
  knn_crypto_test_pred <- knn(trainDS, testDS, train_labels, k = k)
}

When I call knn_pred_func(crypto_data, 3, 1) I get the following error-

Error in knn(trainDS, valDS, train_labels, k = k) : NA/NaN/Inf in foreign function call (arg 6) In addition: Warning messages: 1: In knn(trainDS, valDS, train_labels, k = k) : NAs introduced by coercion 2: In knn(trainDS, valDS, train_labels, k = k) : NAs introduced by coercion

What does this mean and how can I fix it? I have tried several variations of the knn_pred_func that all come up with the same error. Also, initially I had a separate set for train/val/test labels but I kept only the train_labels after looking at an online posting- isnt this wrong? Shouldnt I be feeding the labels to each knn of the corresponding dataset?

zsad512
  • 861
  • 3
  • 15
  • 41
  • See how to share data in a [reproduicble format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so it will be easier to help you. `knn()` isn't a built in function so what package are you using? – MrFlick Oct 25 '17 at 20:30
  • Hard to tell without being able to see your full dataset, but the issue might be that the datasets you're feeding into `knn` have missing value(s). You could easily check this. – jruf003 Oct 25 '17 at 21:07
  • @MrFlick please see my edits, I added the packages – zsad512 Oct 26 '17 at 00:03

1 Answers1

0

I suspect that the problem lies in your date-time column of crypto_data. The error message you get indicates that your input data frame cannot be processed by knn(). Please have a look at a very detailed answer to a similar question here: Error with knn function

Unless time is an important feature for your classification task, I would suggest to drop it and use:

knn_pred_func(crypto_data[,-1], 3, 1)
apitsch
  • 1,532
  • 14
  • 31
  • I tried your suggestion, I still end up with the following error: `Error in knn(train = select(trainDS, -time, -targetVar), test = select(valDS, : 'train' and 'class' have different lengths` – zsad512 Oct 26 '17 at 13:58
  • You should also delete the correct labels in the `trainDS`, `testDS` and `valDS` when you call `knn()`. – apitsch Oct 26 '17 at 17:31