0

I'm new to Machine learning and I was successful in building a KNN classifier. Now I wanted to implement n cross validation but it was taking too long to do so in R. Is there a more efficient way of doing ?

Below is my code (been running for 30 minutes now ...):

require(class)
set.seed(2095)
#dataset source:https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
normalize<-function(x){
  return ((x - min(x)) / (max(x)-min(x)))
}
#removed duplicate data and label information that are not "normal." as "attack"
dataset <- read.csv("data/kdd_data_10pc_cleansed_removed_dup.csv",header=FALSE,sep=",")
names <- read.csv("data/kdd_names.csv",header=FALSE,sep=";")
names(dataset) <- sapply((1:nrow(names)),function(i) toString(names[i,1]))

#extracting relevant features
dataset_extracted<-dataset[,c("duration", "src_bytes","dest_bytes","land","wrong_fragments","count", "diff_srv_rate", "dst_host_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_serror_rate", "under_attack")]

#shuffling of data randomly
rand_sorter = runif(nrow(dataset_extracted))
dataset_extracted <-dataset_extracted[order(rand_sorter),]

#normalizing of data from column 1-11 to value of 0 to 1
dataset_normalized <-as.data.frame(lapply(dataset_extracted[,c(1,2,3,4,5,6,7,8,9,10,11)], normalize))
folds <-cut(seq(1,nrow(dataset_normalized)), breaks=10, labels=FALSE)
each_k_error = NULL
for(j in 1:145586){
  avg_error = NULL
  for(i in 1:10){
    testIndexes <- which(folds==i, arr.ind=TRUE)
    testData <- dataset_normalized[testIndexes, ]
    trainData <- dataset_normalized[-testIndexes,]
    tempM = knn(train = trainData, test = testData, cl = dataset_extracted[-testIndexes,5], k = j)
    tempTestTarget <- dataset_extracted[testIndexes,5]
    tempTable = table(tempTestTarget, tempM)
    #use row percentage to count error
    error_per_class = diag(prop.table(tempTable,1))
    avg_error <- c(avg_error, mean(error_per_class))
    #avg_error = avg_error + mean(error_per_class)
  }
  each_k_error <- c(each_k_error, list(j, mean(avg_error)))
}
misctp asdas
  • 973
  • 4
  • 13
  • 35
  • thats the total amount of rows in the dataset. so it will try each of the rows in dataset (as test datA) against the rest as training data – misctp asdas Nov 27 '16 at 06:46
  • @ChirayuChamoli im afraid i need to use KNN due to some constraints.. i was wondering if kFold libraries might help.. its arduously slow ! – misctp asdas Nov 27 '16 at 06:52
  • @ChirayuChamoli thanks for the reply. im new to machine learning, could you suggest the modifications my current code set ? – misctp asdas Nov 27 '16 at 07:03
  • @ChirayuChamoli really appreciate you helping out. i also tried out the steps at this link http://cbio.mines-paristech.fr/~jvert/svn/tutorials/practical/knnregression/knnregression.R where i was able to adapt to the code but again it took extremely long ... – misctp asdas Nov 27 '16 at 07:10
  • no the link demonstrates how to do PCA... my dataset comes from a different source.. see my code for the source of the dataset – misctp asdas Nov 27 '16 at 12:41
  • Im talking about kdd link. I have picked up *kddcup.data_10_percent.gz* and *kddcup.names*. The data seems to habe 42 cols while the names only has 22 labels in it. – Chirayu Chamoli Nov 27 '16 at 12:48
  • yes, i only took out relevant features as shown in dataset_extracted_features. In addition i added a new column to determine if the labels are not "normal.", the column of under_attack will be true. Else the column of under_attack will be false – misctp asdas Nov 27 '16 at 13:02
  • @ChirayuChamoli sure, i uploaded to google drive https://drive.google.com/open?id=0B1PzIQnZX6AESkg2NzFKX0l2eFU (dataset) and https://drive.google.com/file/d/0B1PzIQnZX6AETjZ6OEtfelZEU0k (names) – misctp asdas Nov 27 '16 at 13:35
  • I think you should read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It would be better if you simulate a dataset. – Chirayu Chamoli Nov 27 '16 at 13:46
  • @ChirayuChamoli how do i simulate a dataset? i mean this is the dataset i have so what should i do to "simulate"? – misctp asdas Nov 27 '16 at 13:49
  • I will give you a small example of creating a data set `df = data.frame(matrix(1:420, 42, 10), stringsAsFactors=F)`. – Chirayu Chamoli Nov 27 '16 at 13:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/129144/discussion-between-misctp-asdas-and-chirayu-chamoli). – misctp asdas Nov 27 '16 at 14:15

0 Answers0