1

I have read up on vectorization as a solution for speeding up a for-loop. However, the data structure I am creating within a for-loop seems to need to be a data.frame/table.

Here is the scenario:

I have a large table of serial numbers and timestamps. Several timestamps can apply to the same serial number. I only want the latest timestamp for every serial number.

My approach now is to create a vector with unique serial numbers. Then for each loop through this vector, I create a temporary table that holds all observations of a serial number/timestamp combinations ('temp'). I then take the last entry of this temporary table (using tail command) and put it into another table that will eventually hold all unique serial numbers and their latest timestamp ('last.pass'). Finally, I simply remove rows from the starting table serial where number/timestamp combination cannot be found 'last.pass'

Here is my code:

#create list of unique serial numbers found in merged 9000 table
hddsn.unique <- unique(merge.data$HDDSN)

#create empty data.table to populate
last.pass < data.table(HDDSN=as.character(1:length(hddsn.unique)),
   ENDDATE=as.character(1:length(hddsn.unique)))

#populate last.pass with the combination of serial numbers and their latest timestamps
for (i in 1:length(hddsn.unique)) {
  #create temporary table that finds all serial number/timestamp combinations
  temp <- merge.data[merge.data$HDDSN %in% hddsn.unique[i],][,.(HDDSN, ENDDATE)]
  #populate last.pass with the latest timestamp record for every serial number
  last.pass[i,] <- tail(temp, n=1)
}

match <- which(merge.data[,(merge.data$HDDSN %in% last.pass$HDDSN) &
         (merge.data$ENDDATE %in% last.pass$ENDDATE)]==TRUE)
final <- merge.data[match]

My ultimate question is, how do I maintain the automated nature of this script while speeding it up, say, through vectorization or turning it into a function.

Thank you!!!

Pablo Boswell
  • 805
  • 3
  • 13
  • 30
  • 2
    You should include sample input data and the desired output for that sample data to make your problem [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick Feb 10 '15 at 23:20
  • From your description this sounds very easy, either with dplyr or data table or even base functions. But, as Mr Flick says, we need an example. Just 5-10 rows of input and the output you want. Use `dput()` to share your data and we'll get all the write classes. – Gregor Thomas Feb 11 '15 at 00:00
  • Also, when you say "large", it's nice to have a scale for that. Do you mean >10 million rows? 1-10 million? An order of magnitude estimation is enough. – Gregor Thomas Feb 11 '15 at 00:02
  • @Martin Morgan has it right. This is what the data looks like. Size is Closer to 1 million rows. I am trying his method and will get back to you. – Pablo Boswell Feb 12 '15 at 21:08

2 Answers2

0

How about this. Without a clear idea of what your input data looks like, I took a guess.

# make some dummy data with multiple visits per serial
merge.data <- data.frame(HDDSN = 1001:1020, 
    timestamps = sample(1:9999, 100))

# create a function to find the final visit for a given serial
fun <- function(serial) {
    this.serial <- subset(merge.data, HDDSN==serial)
    this.serial[which.max(this.serial$timestamps), ]
}

# apply the function to each serial number and clean up the result
final <- as.data.frame(t(sapply(unique(merge.data$HDDSN), fun)))
J. Win.
  • 6,662
  • 7
  • 34
  • 52
0

This data has several ENDDATE for each HDDSN

merge.data <- data.frame(HDDSN = 1001:1100, ENDDATE = sample(9999, 1000))

place it in order, first by HDDSN then by ENDDATE

df = merge.data[do.call("order", merge.data),]

then find the last entry for each HDDSN

df[!duplicated(df[["HDDSN"]], fromLast=TRUE),]

The following illustrate the key steps

> head(df, 12)
    HDDSN    ENDDATE
701  1001          4
101  1001        101
1    1001       1225
301  1001       2800
201  1001       6051
501  1001       6714
801  1001       6956
601  1001       7894
401  1001       8234
901  1001       8676
802  1002        247
402  1002        274
> head(df[!duplicated(df[["HDDSN"]], fromLast=TRUE),])
    HDDSN    ENDDATE
901  1001       8676
902  1002       6329
803  1003       9947
204  1004       8825
505  1005       8472
606  1006       9743

If there are composite keys, then look for duplicates on a data.frame rather than a vector, !duplicated(df[, c("key1", "key2")]), as illustrated in the following:

> df = data.frame(k0=c(1:3, 1:6), k1=1:3)
> df[!duplicated(df, fromLast=TRUE),]
  k0 k1
1  1  1
2  2  2
3  3  3
7  4  1
8  5  2
9  6  3

(the row numbers are from the original data frame, so rows 4-6 were duplicates). (Some care might need to be taken, especially if one of the columns is numeric, because duplicated.data.frame pastes columns together into a single string and rounding error may creep in).

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Martin, how might you adapt the last command to accommodate 2 variables from df. That is, it will work for a composite key scenario. – Pablo Boswell Feb 27 '15 at 21:51
  • @PabloBoswell I tried to update the response with my understanding of your question. – Martin Morgan Feb 27 '15 at 22:35
  • Thank you! Do you know if the unique() function works similarly. Could you just write last.pass[unique(last.pass[, c("HDDSN","PHEADNO")], fromLast=TRUE),] – Pablo Boswell Feb 28 '15 at 02:22