0

I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl is a data frame that contains d being the device id and s being the number of sequences of length 300.

First, I am labelling (column Sid) all the sequences rep(1,300) followed by rep(2,300) and so on till rep(s,300). Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid vector though.

I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.

Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).

#subset processed data by device

for (ds in 1:387){
  d <- dsl[ds,1]
  s <- dsl[ds,3]

  temp.data <- subset(data,data$Device==d)
  temp.data$Sid <- 0
  temp.data[1:(s*300),4] <- rep(1:300,s)
  temp.data <- subset(temp.data,temp.data$Sid!="0")

  seqid <- NA

  for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }

  temp.data$Sid <- seqid

  predata <- as.data.frame(matrix(numeric(0),s,600))


  for(k in 1:s){
    temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
    predata[k,] <- t(stack(temp.data2)[,1])
  }

  ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
  finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}

Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.

Any help will be appreciated.

Thanks

bnjmn
  • 4,508
  • 4
  • 37
  • 52
  • Can you [provide a sample of your dataset](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? `head(data)`? It's kinda difficult to parse what's going on where. – bnjmn Nov 17 '13 at 20:14

1 Answers1

0

Ok, I am going to suggest a radical simplification of your code within the loop. However, it is hard to verify that I in fact did assume the right thing without having sample data. So please ensure that my predata in fact equals yours.

First the code:

for (ds in 1:387){
  d <- dsl[ds,1]
  s <- dsl[ds,3]

  temp.data <- subset(data,data$Device==d)
  temp.data <- temp.data[1:(s*300),]

  predata <- cbind(matrix(temp.data[,1], byrow=T, ncol=300), matrix(temp.data[,2], byrow=T, ncol=300))

  ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
  finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}

What I understand you are doing: Take the first 300*s elements from your subset(data, data$Devide == d. This might easily be done using the command

temp.data <- temp.data[1:(s*300),]

Afterwards, you collect a matrix that has the first row c(temp.data[1:300, 1], temp.data[1:300, 2]), and so on for all further rows. I do this using the matrix command as above.

I assume that your outer loop could be transformed in a call to tapply or something similar, but therefore, we would need more context.

Thilo
  • 8,827
  • 2
  • 35
  • 56
  • I understand there are multiple loops involved. I think the main stumbling block in this is to find some way to use an inbuilt R function like"tapply" which would break the data down, process the pieces and assemble in a separate variable. If there is a particular way to go about this, then this can be applied to the problem. Is that possible? – user2977721 Nov 18 '13 at 15:48
  • @user2977721 Have you tried my version above? Depending on the size of `s`, we might already have saved quite a lot of time. To simplify you outer loop, we definitely need more context and probably a sample dataset. However, I fear that `kmeans` also might be part of the bottleneck. – Thilo Nov 18 '13 at 16:26
  • One more remark: Try profiling your loop using `Rprof`. If the most time is lost with the `subset` and `rbind` command, some version of `apply` might speed things up. If the bottleneck is `kmeans`, your facing a totally different challenge. Generating `predata` as above should not require to much time. – Thilo Nov 18 '13 at 16:30