1

I am trying to train SVM for anomaly detection. For this, I created train_data and test_data using only sourceip and protocol. However when I am trying to use plot function, it gives me below error...

> plot(svmfit,testdat)
Error in Summary.factor(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,  : 
  min not meaningful for factors

How can I get rid of this error..?

Following are the lines of commands in an external file

    train_data=read.csv("packetcapture_training.csv")
    #read only source ip and protocol  
    xtrain=train_data[4:23,c(3,5)]
    ytrain=c(rep(-1,10),rep(1,10))
    dat=data.frame(x=xtrain,y=as.factor(ytrain))
    library("e1071")
    svmfit=svm(y~.,data=dat,kernel="radial",cost=10,scale=FALSE)
    summary(svmfit)
    test_data=read.csv("packetcapture_testing.csv")
    #read only source ip and protocol
    xtest=test_data[371:390,c(3,5)]
    ytest=c(rep(1,10),rep(-1,10))
    testdat=data.frame(x=xtest,y=as.factor(ytest))
    plot(svmfit,testdat)





    > dat
                   x.Source x.Protocol  y
1  fe80::a00:27ff:feee:7ec6     ICMPv6 -1
2  fe80::a00:27ff:feee:7ec6     ICMPv6 -1
3  fe80::a00:27ff:feee:7ec6     ICMPv6 -1
4               172.16.11.1        TCP -1
5             192.168.2.101        TCP -1
6               172.16.11.1        TCP -1
7               172.16.11.1        TCP -1
8               172.16.11.1        TCP -1
9             192.168.2.101        TCP -1
10            192.168.2.101        TCP -1
11              172.16.11.1        TCP  1
12              172.16.11.1        TCP  1
13              172.16.11.1        TCP  1
14            192.168.2.101        TCP  1
15              172.16.11.1        TCP  1
16            192.168.2.101        TCP  1
17              172.16.11.1        TCP  1
18              172.16.11.1        TCP  1
19            192.168.2.101      SSHv2  1
20              172.16.11.1        TCP  1

> dput(head(dat,4))
structure(list(x.Source = structure(c(6L, 6L, 6L, 1L), .Label = c("172.16.11.1", 
"192.168.2.100", "192.168.2.101", "CadmusCo_8b:7b:80", "CadmusCo_ee:7e:c6", 
"fe80::a00:27ff:feee:7ec6"), class = "factor"), x.Protocol = structure(c(5L, 
5L, 5L, 7L), .Label = c("ARP", "DNS", "HTTP", "ICMP", "ICMPv6", 
"SSHv2", "TCP", "UDP"), class = "factor"), y = structure(c(1L, 
1L, 1L, 1L), .Label = c("-1", "1"), class = "factor")), .Names = c("x.Source", 
"x.Protocol", "y"), row.names = c(NA, 4L), class = "data.frame")

> testdat
         x.Source x.Protocol  y
371   172.16.11.1        TCP  1
372   172.16.11.1        TCP  1
373   172.16.11.1        TCP  1
374   172.16.11.1        TCP  1
375   172.16.11.1        TCP  1
376   172.16.11.1        TCP  1
377   172.16.11.1        TCP  1
378   172.16.11.1        TCP  1
379   172.16.11.1        TCP  1
380   172.16.11.1        TCP  1
381   172.16.11.1        TCP -1
382   172.16.11.1        TCP -1
383   172.16.11.1        TCP -1
384   172.16.11.1        TCP -1
385   172.16.11.1        TCP -1
386   172.16.11.1        TCP -1
387   172.16.11.1        TCP -1
388   172.16.11.1        TCP -1
389 192.168.2.101      SSHv2 -1
390 192.168.2.101     ICMPv6 -1


> dput(head(testdat,4))
structure(list(x.Source = structure(c(1L, 1L, 1L, 1L), .Label = c("172.16.11.1", 
"192.168.2.100", "192.168.2.101", "CadmusCo_8b:7b:80", "CadmusCo_ee:7e:c6", 
"fe80::a00:27ff:feee:7ec6"), class = "factor"), x.Protocol = structure(c(7L, 
7L, 7L, 7L), .Label = c("ARP", "DNS", "HTTP", "ICMP", "ICMPv6", 
"SSHv2", "TCP", "UDP"), class = "factor"), y = structure(c(2L, 
2L, 2L, 2L), .Label = c("-1", "1"), class = "factor")), .Names = c("x.Source", 
"x.Protocol", "y"), row.names = 371:374, class = "data.frame")
dudedev
  • 451
  • 1
  • 5
  • 19
  • Because you didn't include any data, this error is not reproducible. If you want to make it easier for people to help you, please see the guide on [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick May 18 '14 at 00:05
  • Sorry for less info.. I am not sure if this will be enough...I guess, I am making a mistake with formatting of columns... how should I format IPaddress for SVM learning... – dudedev May 19 '14 at 20:37

1 Answers1

-1

The plot.svm function in library("e1071") apparently only like to plot continuous predictors. Because your model uses two categorical predictors, you are getting that error. Do you know what kind of visualization you were expecting?

In the examples on the help page, it shows

data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)

And there it can spread out points along a range and the cutting can happen at a meaningful break point. With categorical predictors, they are not ordered so there's no clear way to plot them in a similar way really.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks for your answer.I just wanted that it should plot the data points with classification, so that I can see two clear classes on the plot.It is considering SourceIp as categorical data since I passed it in the string format.I know it should not be the case, problem lies here. I am not sure how should I input the ip address from csv file to create a svm model. – dudedev May 21 '14 at 09:24
  • @user3648560 You passed everything as a `factor`. That seems like the right choice given your data. It's just more difficult to make plots for discrete data. There's no obvious way to "spread out" the data to see it. – MrFlick May 21 '14 at 12:33
  • Ok. Thank you. I changed the source Ip into four different octets and passed it as numerical data.When I use predict function, to predict, it gives error that "Test data does not match the model".All commands are same, csv file has 5 columns now, 4 for source ip octets and 1 for protocol. – dudedev May 22 '14 at 08:20
  • @user3648560 The comments are not a good place to ask new questions. You may start a new post on this site if you are now having a different problem. – MrFlick May 22 '14 at 14:32