1

I'm new in R, so help me please to understand what is wrong. I'm trying to predict some data, but object that predict function returns (it is strange class (factor)) contains low data. Test set size is 5886 obs. of 160 variables, when predict object lenght is 110... I expected vector of predicted classes or data frame back. What do I understand wrong?

library(MASS)
library(e1071)
set.seed(333)

data <- read.csv(file="D:\\MaсhLearningAssign\\pml-training.csv", head=TRUE, sep=",")

index <- 1:nrow(data)
testindex <- sample(index, trunc(length(index)*30/100))
train <- data[-testindex, ]
test <- data[testindex, ]

model  <- svm(classe~., data = train, kernel="radial", gamma=0.001, cost=10)
prediction <- predict(model, test)
summary(prediction)



Output:
    A  B  C  D  E 
    28 24 25 12 22 

Dataset here

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
UndeadDragon
  • 717
  • 8
  • 23
  • [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – zero323 Dec 19 '14 at 22:19

1 Answers1

3

svm doesn't handle missing observations and your data set is full of NAs:

> dim(data[complete.cases(data), ])
[1] 406 160

You can try to remove columns with NAs and then train svm

> data <- data[, which(colSums(apply(data, 2, is.na)) == 0)]
> dim(data)
[1] 19622    93

Now you can try to split your data and fit svm. I would be careful though. It still pretty big data set and svm is rather resource hungry.

Hint: I looked at your data and if it is what I think it is please be sure read carefully data set description. You have two, completely different types of rows. It should explain not only abundance of NAs, but also give the idea which will be useful for prediction given your test set.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank u! What i can use instead SVM? – UndeadDragon Dec 19 '14 at 23:53
  • And how I can plot ROC with that prediction object? – UndeadDragon Dec 20 '14 at 00:07
  • The first question is quite broad. Pretty much every technique you'll use will require some way of handling missing observations and it is not an easy topic by itself. Even if you ignore this part choice of the method really depends on your goals and it is really not a programming problem. Regarding ROC you can try https://rocr.bioinf.mpi-sb.mpg.de/ but quick Google search will help find some other choices. – zero323 Dec 20 '14 at 02:25
  • Sry, and last question :) Dont want to create new. I tried to apply in the same way prediction on special data set (from here https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv) but R says "Error in `contrasts <-` (` * tmp * `, value = contr.funs [1 isOF [nn]]): contrasts can be applied only to factors with two or more levels". I spend many hours but cant understand why test from train dataset works and this testset not. Again N/A problems? But when I tried to drop N/A from test set too and R start to complain on missed columns or something like that. – UndeadDragon Dec 20 '14 at 03:10
  • Take a look at the `new_window` attribute. It has only one level. And dropping NAs - you always want to reprocess your train and test data in the same way. So either keep the same columns in a both or apply the same filling strategy. – zero323 Dec 20 '14 at 12:13
  • BTW I've added some hint to the answer. I won't do your homework for you, but there is an clear strategy how to deal with your problems. – zero323 Dec 20 '14 at 12:18
  • Hi. new_window has only one level at train and test both. but if pass to predict function part of train set - no errors. so I still not understand. I tried "train <- train[ , which(names(train) %in% names(test))]", but that didn't help too. And this is not homework, its course on coursera, so knowledge is primary for me, but R errors so unclear for me that I wasting too much time on it. – UndeadDragon Dec 20 '14 at 12:23
  • Like said above: "you have two, completely different types of rows". Just figure it out. It is not a programming problem. – zero323 Dec 20 '14 at 13:06
  • finally i found it, thank u. I have no idea why they give to first exercise so dirty data. – UndeadDragon Dec 20 '14 at 13:07
  • Sure. I would argue it is not dirty, and both parts can be useful, but there is take away message here. whenever you get a new data set just try to figure out what is going on inside, do proper exploratory analysis, read code books, descriptions and so on. Because methods you use usually don't care. If it wasn't for NAs and factors you wouldn't even notice there is something more here. This is a well known data set. It's been more than once, it's been published, there is description available. But most of the time you don't have this comfort. Good luck with the course. – zero323 Dec 20 '14 at 13:43