I have read up on vectorization as a solution for speeding up a for-loop. However, the data structure I am creating within a for-loop seems to need to be a data.frame/table.
Here is the scenario:
I have a large table of serial numbers and timestamps. Several timestamps can apply to the same serial number. I only want the latest timestamp for every serial number.
My approach now is to create a vector with unique serial numbers. Then for each loop through this vector, I create a temporary table that holds all observations of a serial number/timestamp combinations ('temp'). I then take the last entry of this temporary table (using tail command) and put it into another table that will eventually hold all unique serial numbers and their latest timestamp ('last.pass'). Finally, I simply remove rows from the starting table serial where number/timestamp combination cannot be found 'last.pass'
Here is my code:
#create list of unique serial numbers found in merged 9000 table
hddsn.unique <- unique(merge.data$HDDSN)
#create empty data.table to populate
last.pass < data.table(HDDSN=as.character(1:length(hddsn.unique)),
ENDDATE=as.character(1:length(hddsn.unique)))
#populate last.pass with the combination of serial numbers and their latest timestamps
for (i in 1:length(hddsn.unique)) {
#create temporary table that finds all serial number/timestamp combinations
temp <- merge.data[merge.data$HDDSN %in% hddsn.unique[i],][,.(HDDSN, ENDDATE)]
#populate last.pass with the latest timestamp record for every serial number
last.pass[i,] <- tail(temp, n=1)
}
match <- which(merge.data[,(merge.data$HDDSN %in% last.pass$HDDSN) &
(merge.data$ENDDATE %in% last.pass$ENDDATE)]==TRUE)
final <- merge.data[match]
My ultimate question is, how do I maintain the automated nature of this script while speeding it up, say, through vectorization or turning it into a function.
Thank you!!!