I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl
is a data frame that contains d
being the device id and s
being the number of sequences of length 300.
First, I am labelling (column Sid
) all the sequences rep(1,300)
followed by rep(2,300)
and so on till rep(s,300)
. Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid
vector though.
I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.
Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).
#subset processed data by device
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data$Sid <- 0
temp.data[1:(s*300),4] <- rep(1:300,s)
temp.data <- subset(temp.data,temp.data$Sid!="0")
seqid <- NA
for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }
temp.data$Sid <- seqid
predata <- as.data.frame(matrix(numeric(0),s,600))
for(k in 1:s){
temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
predata[k,] <- t(stack(temp.data2)[,1])
}
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.
Any help will be appreciated.
Thanks