17

How can I make a ksvm model aware that the first 100 numbers in a dataset are all time series data from one sensor, while the next 100 numbers are all time series data from another sensor, etc, for six separate time series sensor inputs? Alternatively (and perhaps more generally), how can I present two-dimensional input data to an SVM?

The process for which I need a binary yes/no prediction model has six non-periodic time series inputs, all with the same sampling frequency. An event triggers the start of data collection, and after a pre-determined time I need a yes/no prediction (preferably including a probability-of-correctness output). The characteristics of the time-series inputs which should produce 'yes' vs. 'no' are not known, but what is known is that there should be some correlation between each of the the input time series data and the final outcome. There is also significant noise present on all inputs. Both the meaningful information as well as the noise appear on the inputs as short-duration bursts (the meaningful bursts are always in the same general time for a given input source), but identifying which bursts are meaningful and which are noise is difficult; i.e. the fact that a burst happened at the "right" time for one input does not necessarily indicate a "yes" output; it may just be noise. To know whether the prediction should be "yes", the model needs to somehow incorporate information from all six time series inputs. I have a collection of prior data with approximately 900 'no' results and 100 'yes' results.

I'm pretty new to both R and SVM's, but I think I want to use an SVM model (kernlab's ksvm). I'm having trouble figuring out how to present the input data to it. I'm also not sure how to tell ksvm that the data is time series data, or if that is even relevant. I've tried using the Rattle GUI front-end to R to pull in my data from csv files, but I can't figure out how to present the time series data from all six inputs into the ksvm model. As a csv-file input, it seems the only way to import the data for all 1000 samples is by organizing the input data such that all sample data (for all six time series inputs) is on a single line of the csv file, with a separate known-result file's data presented on each line of the csv file. But in doing so, the fact that the 1st, 2nd, 3rd, etc. numbers are each part of the time series data from the first sensor is lost in the translation, as well as the fact that the 101st, 102nd, 103rd, etc. numbers are each part of the time series data from the second sensor, and so on; to the ksvm model, each data sample is just considered an isolated number unrelated to its neighbor. How can I present this data to ksvm as six separate but interrelated time series arrays? Or how can I present a 2-dimensional array of data to ksvm?


UPDATE:

OK, there are two basic strategies I've tried with dismal results (well, the resulting models were better than blind guessing, but not much).

First of all, not being familiar with R, I used the Rattle GUI front-end to R. I have a feeling that by doing so I may be limiting my options. But anyway, here's what I've done.....

Example known result files (shown with only 4 sensors instead of 6, and only 7 time samples instead of 100):

training168_yes.csv

Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454768042.4,           0,      0,      0,      0
454768042.6,           51,     60,     0,      172
454768043.3,           0,      0,      0,      0
454768043.7,           300,    0,      0,      37
454768044.0,           0,      0,      1518,   0
454768044.3,           0,      0,      0,      0
454768044.7,           335,    0,      0,      4273

training169_no.csv

Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454767904.5,           0,      0,      0,      0
454767904.8,           51,     0,      498,    0
454767905.0,           633,    0,      204,    55
454767905.3,           0,      0,      0,      512
454767905.6,           202,    655,    739,    656
454767905.8,           0,      0,      0,      0
454767906.0,           0,      934,    0,      7814

The only way I know to get the data for all training samples into R/Rattle is to massage & combine all result files into a single .csv file, with one sample result per line. I can think of only two ways to do that, so I tried them both (and I knew when I was doing it that by doing this I'm hiding potentially important information, which is the point of this SO question):

TRIAL #1: For each result file, add each sensor's samples into a single number, blasting away all temporal information:

result,sensor1,sensor2,sensor3,sensor4
no,    886,    1589,   1441,   9037
yes,   686,    60,     1518,   4482
no,    632,    1289,   1173,   9152
yes,   411,    67,     988,    5030
no,    772,    1703,   1351,   9008
yes,   490,    70,     1348,   4909

When I get done using Rattle to generate the SVM, Rattle's log tab gives me the following script which can be used to generate & train an SVM in RGui:

library(rattle)
building <- TRUE
scoring  <- ! building
library(colorspace)
crv$seed <- 42 
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary1.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed) 
crs$nobs <- nrow(crs$dataset) # 6 observations 
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$numeric <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$categoric <- NULL
crs$target  <- "result"
crs$risk    <- NULL
crs$ident   <- NULL
crs$ignore  <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
      data=crs$dataset[,c(crs$input, crs$target)],
      kernel="polydot",
      kpar=list("degree"=1),
      prob.model=TRUE)

TRIAL #2: For each result file, add the samples for all sensors for each time into a single number, blasting away any information about individual sensors:

result,time1, time2, time3, time4, time5, time6, time7
no,    0,     549,   892,   512,   2252,  0,     8748
yes,   0,     283,   0,     337,   1518,  0,     4608
no,    0,     555,   753,   518,   2501,  0,     8984
yes,   0,     278,   12,    349,   1438,  3,     4441
no,    0,     602,   901,   499,   2391,  0,     7989
yes,   0,     271,   3,     364,   1474,  1,     4599

And again I use Rattle to generate the SVM, and Rattle's log tab gives me the following script:

library(rattle)
building <- TRUE
scoring  <- ! building
library(colorspace)
crv$seed <- 42 
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary2.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed) 
crs$nobs <- nrow(crs$dataset) # 6 observations 
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$numeric <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$categoric <- NULL
crs$target  <- "result"
crs$risk    <- NULL
crs$ident   <- NULL
crs$ignore  <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
      data=crs$dataset[,c(crs$input, crs$target)],
      kernel="polydot",
      kpar=list("degree"=1),
      prob.model=TRUE)

Unfortunately even with nearly 1000 training datasets, both of the resulting models give me only slightly better results than I would get by just random chance. I'm pretty sure it would do better if there's a way to avoid blasting away either the temporal data or the distinction between different sensors. How can I do that? BTW, I don't know if it's important, but the sensor readings for all sensors are taken at almost exactly the same time, but the time difference between one reading and the next varies by maybe 10 to 20% generally from one run to the next (i.e. between "training" files), and I have no control over that. I think that's probably safe to ignore (i.e. I think it's probably safe to just number the readings sequentially like 1,2,3,etc.).

phonetagger
  • 7,701
  • 3
  • 31
  • 55
  • 3
    When I first started posting R questions, I was asked to ["make a reproducible example"](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250). These tips helped me ask better R questions and even answer my own R questions. In a nutshell, if you provide sample data in your question along with the code that you think should work (or that is as close as you can get to making it work), and the expected output, then we all have a great start to helping you find your answer. – Christopher Bottoms Apr 20 '15 at 13:34
  • Have you looked into Kalman Filters? They are meant to combine two error prone signals into one more reliable signal. – Benjy Kessler Apr 20 '15 at 14:10
  • @BenjyKessler - I'm not sure if Kalman is applicable in my case? The Wikipedia article about Kalman filters talks about a "fixed lag smoother", but in my case the lag between bursts from one sensor vs the other sensors is unknown, but known to (probably) exist. I'm hoping the SVM model can automatically tune itself to detect the lags, if they're significant. And the lags will change from time to time anyway; I'm hoping every time the process changes I can just re-train the SVM with new training data. Can Kalman still help if I don't know the lags? Also, can it help if I have 6 inputs, not 2? – phonetagger Apr 20 '15 at 15:12
  • @ChristopherBottoms - Thank you for your suggestion. I am working on putting together a small example of what I've already tried, with dismal results. – phonetagger Apr 20 '15 at 15:14
  • @ChristopherBottoms, Hopefully my update is sufficient as a reproducible example? – phonetagger Apr 21 '15 at 19:16
  • They should make it a requirement that if a person wants to downvote a question, they MUST leave a comment as to why they downvoted it (anonymously if they so choose), and if the comment makes no sense, anyone can flag it for reversal. – phonetagger Apr 23 '15 at 19:08
  • I never tried something like that, but there's something that could work. `kernlab` supports `kernelMatrix` input to `ksvm`, so you can actually calculate a similiarity matrix between timeseries comparing `sensor_i` with its correspondence in other series, then aggregate the matrices. That way it's not agnostic anymore regarding the sensors. – catastrophic-failure Jul 22 '16 at 18:45

1 Answers1

1

SVM takes a feature vector and uses it to build a classifier. Your feature vectors can be of 6 dimensions each from a different source and time as the seventh dimension. Each point in time from which you have a signal will produce another vector. Create t vectors, Vt, of size 7 each and make those your feature vectors. Populate them with your data and pass them into ksvm. By adding t as another feature in the feature vector you are correlating both all the data that happened at a specific time with each other but also it will help SVM learn that their is a progression of values. You can the choose a subset of Vt as a training set. You will have to manually tag these vectors with a label that is the correct classification.

Benjy Kessler
  • 7,356
  • 6
  • 41
  • 69
  • I'm not very fluent in either R or statistics, so I don't know how to interpret `S1,...,S6,t produces {Vt}, t=t0->tn Vt<-R^7`. Can you add info in your answer that breaks that down bit by bit & explain what each piece means, and what you mean by "produces", including the meaning of the comma following it and the "->" and "<-" symbols? Thanks very much. – phonetagger Apr 20 '15 at 15:18
  • I rephrased, I am not an expert in R either. This should be your approach. Maybe someone else who knows R can help you with the implementation. – Benjy Kessler Apr 20 '15 at 15:20
  • Have you created ksvm objects manually (not using Rattle) in R? I'm having trouble figuring out how to create these "feature vectors" you speak of. I can load a single "observation" from its .csv file, which produces a table whose typeof() is "list", although when I print it, it appears to be a matrix (2-dimensional array). Do I just create a list of these 2-dimensional objects & simply feed that in as the "data" to the ksvm? BTW, I'm now converting the timestamps from the number of seconds since 2000 to just the number of seconds since the start, so the 1st row's timestamp is 0.000 and so on. – phonetagger Apr 22 '15 at 18:33
  • Umm I've actually never used R before :) IIUC you need to build a 2-d matrix where each row is a feature vector. – Benjy Kessler Apr 22 '15 at 18:35
  • Wow. R must not be very popular on SO. People are crawling out of the woodwork to answer C and C++ questions. – phonetagger Apr 22 '15 at 18:40
  • There's a stack exchange site called cross validated specifically for questions like this. – Benjy Kessler Apr 22 '15 at 18:43
  • I'm sorry, I don't understand... are you saying I should have posted this question on that site? – phonetagger Apr 22 '15 at 19:06
  • 1
    Not as such but you might have gotten better answers if you had. – Benjy Kessler Apr 22 '15 at 19:07
  • phonetagger, I view your question as a programming question with a statistical component. And agree with @BenjyKessler that [Cross Validated](http://stats.stackexchange.com/) would be a location as well (the exact wording would need to be tweaked, but the main parts are there). – Richard Erickson Apr 24 '15 at 16:53
  • @RichardErickson - Sorry, I should have mentioned that I posted a rewording of this question at http://stats.stackexchange.com/questions/147816/how-to-create-a-multidimensional-data-structure-in-r-as-input-to-kernlabs-ksvm, but still haven't heard from anyone about it. – phonetagger Apr 25 '15 at 17:03
  • I think you need to do some feature extraction from time series there are several ways to do this. Those features then you can feed in SVM. I am not very familiar with this area. You can google and see what methods exists for feature extraction of time series data. – user24318 Mar 25 '18 at 05:09