R Loop to subset large data frame and give multiple row output

Question

Following up from a question I asked yesterday here, I am trying to design a loop that will subset events from data df1 based on unique combinations of matching date, time, and ID in a second dataset df2. The output for each iteration will be multiple rows long, and each iteration will have either differing numbers of rows, or might be empty. In the end I need to combine all the iteration outputs into 1 data frame that show the dates, times, and ID numbers for each event on each date. Allocating an empty matrix and running a regular FOR loop or nested loop isn't getting me anywhere. I don't know if I need to start with a different type of structure, or if my dimensions are wrong. Maybe there is an easier way all together.

Here is a sample of the data structure (although the original data is much longer).

dput(df1)
structure(list(Date = c("12-31-2008", "12-31-2008", "12-31-2008", 
"12-31-2008", "12-31-2008", "12-31-2008", "01-01-2009", "01-01-2009", 
"01-01-2009", "01-01-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-11-2009", "01-11-2009", "01-17-2009", "01-17-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-19-2009", "01-19-2009", "01-19-2009", 
"01-19-2009", "01-19-2009"), IDNum = c("534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198"), Time = c("19:01", 
"19:53", "20:55", "22:03", "23:04", "23:55", "00:45", "01:48", 
"02:50", "03:50", "02:35", "03:42", "04:49", "05:53", "06:55", 
"07:55", "08:43", "10:23", "10:31", "11:41", "15:27", "16:33", 
"17:41", "18:46", "19:46", "20:48", "21:48", "22:48", "23:48", 
"01:49", "02:49", "21:49", "22:49", "12:04", "13:04", "15:05", 
"16:05", "17:05", "18:07", "18:49", "19:49", "20:49", "21:49", 
"22:50", "23:50", "00:50", "01:50", "03:02", "04:22", "05:25"
)), .Names = c("Date", "IDNum", "Time"), row.names = 8643:8692, class = "data.frame")

dput(df2)
structure(list(Date = c("01-04-2009", "01-05-2009", "01-05-2009", 
"01-06-2009", "01-06-2009", "01-07-2009", "01-07-2009", "01-08-2009", 
"01-08-2009", "01-09-2009", "01-09-2009", "01-10-2009", "01-11-2009", 
"01-12-2009", "01-12-2009", "01-13-2009", "01-14-2009", "01-14-2009", 
"01-21-2009", "01-21-2009", "01-22-2009", "01-22-2009", "01-23-2009", 
"01-23-2009", "01-24-2009", "01-24-2009", "01-25-2009", "01-25-2009", 
"01-26-2009", "01-26-2009", "01-27-2009", "01-28-2009", "01-28-2009", 
"01-28-2009", "01-28-2009", "01-29-2009", "01-29-2009", "01-29-2009", 
"01-29-2009", "02-05-2009", "02-05-2009", "02-05-2009", "02-06-2009", 
"02-06-2009", "02-06-2009", "02-07-2009", "02-07-2009", "02-07-2009", 
"02-08-2009", "02-08-2009"), IDNum = c("599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"697345", "697345", "534198", "534198", "697345", "697345", "697345", 
"534198", "697345", "697345", "697345", "697345", "697345", "697345", 
"697345", "697345", "697345", "697345", "697345"), Trip = c("GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0233", "GL0233", "GL0230", "GL0230", "GL0233", 
"GL0233", "GL0233", "GL0230", "GL0234", "GL0234", "GL0234", "GL0234", 
"GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234"
), Replicate = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 14L, 15L, 3L, 4L, 5L, 16L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), Start = c("12:00", 
"08:35", "15:33", "08:30", "15:51", "10:02", "23:04", "11:17", 
"21:31", "11:16", "20:07", "11:28", "07:37", "08:40", "16:32", 
"09:14", "08:04", "15:15", "07:16", "16:17", "07:10", "16:40", 
"07:00", "16:25", "07:17", "16:50", "07:20", "16:18", "07:20", 
"15:40", "07:10", "09:34", "11:07", "07:55", "16:38", "07:01", 
"08:26", "14:47", "07:18", "07:47", "09:17", "14:58", "07:48", 
"08:59", "14:53", "07:30", "09:12", "13:47", "08:56", "09:53"
), End = c("17:21", "15:08", "22:44", "15:12", "09:06", "19:16", 
"10:28", "20:12", "10:14", "18:48", "10:53", "20:23", "14:07", 
"15:02", "22:27", "18:03", "15:07", "21:19", "16:04", "22:04", 
"16:31", "23:01", "16:15", "22:07", "16:33", "22:37", "16:05", 
"22:17", "15:22", "22:31", "16:05", "16:41", "19:01", "16:20", 
"21:56", "14:31", "19:46", "00:30", "15:10", "14:21", "19:27", 
"23:45", "14:31", "19:20", "23:05", "14:51", "20:15", "00:17", 
"14:31", "18:07")), .Names = c("Date", "IDNum", "Trip", "Replicate", 
"Start", "End"), row.names = 506:555, class = "data.frame")

First I found the dates that match between the 2 data sets and created a new variable records to display information from df2 based on the matching dates. In this example, I'm just using the 2nd matching date:

match_dates <- as.character(intersect(df1$Date, df2$Date))
records <- df2[which(df2$Date == match_dates[2]),]
print(records)

          Date  IDNum   Trip Replicate Start   End
518 01-11-2009 599091 GL0229        13 07:37 14:07

In the original much larger data set, records ends up being something more like this:

records <- df2[which(df2$Date == match_dates[25]),]
print(records)
#           Date  IDNum   Trip Replicate Start   End
# 659 04-02-2009 507646 GL0247        10 09:43 05:19
# 660 04-02-2009 680845 GL0249         4 05:37 11:29
# 661 04-02-2009 680845 GL0249         5 11:59 16:47

The event(s) of interest for each iteration of records is then defined as df1 times between Start and End like this (I did it this way to preserve the unique combinations of date-time-ID-replicate):

event1 <- subset(df1, Date==records[1,"Date"] & IDNum==records[1,"IDNum"] & Time >= records[1,"Start"] & Time <= records[1,"End"])
event2 <- subset(df1, Date==records[2,"Date"] & IDNum==records[2,"IDNum"] & Time >= records[2,"Start"] & Time <= records[2,"End"])
event3 <- subset(df1, Date==records[3,"Date"] & IDNum==records[3,"IDNum"] & Time >= records[3,"Start"] & Time <= records[3,"End"])

The results for each of these events looks like this:

print(event1) #This result is empty
    [1] NewRecNum Date      IDNum     Time      Speed    
    <0 rows> (or 0-length row.names)

print(event2)
            Date  IDNum  Time
80620 04-02-2009 680845 06:35
80621 04-02-2009 680845 07:35
80622 04-02-2009 680845 08:35
80623 04-02-2009 680845 09:35
80624 04-02-2009 680845 10:35

print(event3)
                    Date  IDNum  Time
        80626 04-02-2009 680845 12:35
        80627 04-02-2009 680845 13:35
        80628 04-02-2009 680845 14:35
        80629 04-02-2009 680845 15:35
        80630 04-02-2009 680845 16:35

I'm aiming for a loop that will take every instance of matching dates from match_dates (in this case 147), create 147 corresponding records from df2, and then use the Date, IDNum, Start, and End times in each records to subset df1 and output the df1 events. What I have so far (that isn't working):

records <- matrix(ncol=6, nrow=nrow(df1)) # Create an empty matrix to start
event=NULL
for (i in 1:length(match_dates)) 
    { records[i] <- df2[which(df2$Date == match_dates[i]), ]

    for (j in 1:nrow(records[i]))
    { event[j] <- subset(df1, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
      }
}
print(event)

Error in 1:nrow(records[i]) : argument of length 0
In addition: Warning message:
In records[i] <- df2[which(df2$Date == match_dates[i]), ] :
  number of items to replace is not a multiple of replacement length
> print(event)
NULL

Thanks in advance for any help! I'm banging my head against the wall on this.

EDIT/UPDATE:

I changed records to

records <- subset(df2, Date %in% df1$Date)

Then wrote a function to subset the matching rows from df1 as

event_func <- function(df,records,i){
  event_int <- subset(df, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
  return(event_int)
}

This function works, and outputs what I need. But I'm still having trouble with a loop that will take the 686 rows of records, match them to df1, and output a final data frame of all the df1 rows that match. I also tried using lapply Here's what I have (neither of which is working):

# First option using a loop
final <- data.frame()
event_int <- data.frame()

for (i in 1:nrow(records)) {
  event_int[i] <- event_func(df1, records,i)
  final <- rbind(event_int, event_int[i])
}

# Second option using lapply
lapply(records, event_func(df1,records,1:nrow(records)))

Thanks again for any help!

score 1 · Answer 1 · answered Dec 24 '15 at 01:59

There are several problems here.

records[i] is incorrect, if you want to assign to a row you need records[i,]
df2[which(df2$Date == match_dates[i]),] is not guaranteed to be of any particular size and by assigning it in a loop to records[i,] you're making assumptions about it's size. You could assign an intermediate value and put that into records using yet another loop or better yet use the rbind function each iteration of the loop which would remove the need to pre-assign the size of records
Attempting to assign a data.frame (df2) to a matrix (records) without doing any conversion is asking for trouble. records should probably be a data.frame here anyway.

A much simpler way of doing it is to use the match() function via the %in% interface as so

records <- subset(df2,Date %in% df1$Date)

Thanks @NGaffney this was a much nicer way to make records, and gave me all the values I needed. But could you give an example of assigning an intermediate value with a loop to put into records? — marcinus, Dec 24 '15 at 15:36
I updated code in original question above, but am still having trouble with the loop and writing to a final data frame. — marcinus, Dec 26 '15 at 17:30

score 0 · Accepted Answer · edited May 23 '17 at 12:04

Finally got something to work! I ended up changing a bit of the original coding and found a really helpful answer for the loop from another post here.

1) I defined records first by matching ID and Date between df1 and df2

records <- subset(df1, IDNum %in% df2$IDNum)
records <- subset(records, Date %in% df2$Date)

# Records looks like:
head(records,5)
               Date  IDNum  Time    Speed
    8653 01-10-2009 534198 02:35 4.001809
    8654 01-10-2009 534198 03:42 4.117383
    8655 01-10-2009 534198 04:49 4.263277
    8656 01-10-2009 534198 05:53 4.310865
    8657 01-10-2009 534198 06:55 4.353049

# df2 looks like:
head(df2)
          Date  IDNum   Trip Replicate Start   End
506 01-04-2009 599091 GL0229         1 12:00 17:21
507 01-05-2009 599091 GL0229         2 08:35 15:08
508 01-05-2009 599091 GL0229         3 15:33 22:44
509 01-06-2009 599091 GL0229         4 08:30 15:12
510 01-06-2009 599091 GL0229         5 15:51 09:06
511 01-07-2009 599091 GL0229         6 10:02 19:16

2) My function for subsetting records based on matching ID, Date, and Time with df2:

event_func <- function(i,...) {
  event_int <- subset(records, Date==df2[i,"Date"] & IDNum==df2[i,"IDNum"] & Time >= df2[i,"Start"] & Time <= df2[i,"End"])
  output <- event_int
  return(output)
}

# For example, subsetting records based on the first row of df2
event_func(1)
            Date  IDNum  Time    Speed
38613 01-04-2009 599091 12:24 1.611527
38614 01-04-2009 599091 15:58 1.545299
38615 01-04-2009 599091 17:02 1.527205

3) I repeated the event_func over all 686 rows of df2 and put the results into a single data frame using the foreach package.

library(foreach)
final.match <- foreach(i = 1:nrow(df2), .combine=rbind) %do% {
  event_func(i)}

The output for final.match was a single data frame with 4 columns and 1634 rows, which was exactly what I was looking for!

R Loop to subset large data frame and give multiple row output

2 Answers2