Conditionally merging data from two data frames

Question

I am currently trying to combine information from two dfs of eye tracking data. In one df(behavioral), there are the start and end times associated with each trial in the experiment. In the other df(gaze), there is a timestamp of the gaze that was recorded. I want to go through each gaze timestamp and assess whether or not it is within the start and end times of a trial (information drawn from the behavioral df), and if so, add the trial information from the behavioral df to the Trial column within the gaze df.

The dfs are as follows:

Behavioral df
   StartTime    EndTime Trial
1:         0     0.8     a
2:         1     1.8     b
3:         2     2.8     c
4:         3     3.8     d

Gaze df 
  Gaze    x   y Frame   Trial
 1: 0.00 100 200   126    NA
 2: 0.20 101 201   126    NA
 3: 0.40 102 202   127    NA
 4: 0.80 103 203   127    NA
 5: 0.60 104 204   127    NA
 6: 0.90 105 205   127    NA
 7: 1.20 106 206   128    NA
 8: 1.40 107 207   128    NA
 9: 1.60 108 208   128    NA
10: 2.02 109 209   129    NA
11: 2.50 110 210   129    NA
12: 2.90 111 211   129    NA
13: 3.10 112 212   130    NA
14: 3.79 113 213   130    NA

I would want to go though the gaze time stamps. Ie, for Gaze$Gaze[1], is it between 0 and 0.8? Yes >>> Gaze$Trial[1]=a

I have tried

for(i in Gaze$Gaze){
  if(as.numeric(Gaze$Gaze[i]) >= as.numeric(Behavior$StartTime[i])){
    if(as.numeric(Gaze$Gaze[i]) <= as.numeric(Behavior$EndTime[i])){
      Gaze$Trial[i]<-Behavior$Trial[i]
    }
  }
  else Gaze$Trial[i]<-NA

}

I get the error:

Error in if (as.numeric(fakegaze$Gaze[i]) >= as.numeric(fakebehavior$StartTime[i])) { : argument is of length zero

I believe I might need to use another for loop to iterate through the two dfs separately before merging the information, but I'm not sure where to start. Thanks!

Data:

library(data.table)
beh = setDT(structure(list(StartTime = c(0, 1, 2, 3), EndTime = c(0.8, 1.8, 2.8, 3.8
), Trial = c("a", "b", "c", "d")), row.names = c(NA, -4L), class = "data.frame"))

gaze = setDT(structure(list(Gaze = c(0, 0.2, 0.4, 0.8, 0.6, 0.9, 1.2, 1.4, 
1.6, 2.02, 2.5, 2.9, 3.1, 3.79), x = 100:113, y = 200:213, Frame = c(126L, 
126L, 127L, 127L, 127L, 127L, 128L, 128L, 128L, 129L, 129L, 129L, 
130L, 130L), Trial = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), row.names = c(NA, -14L), class = "data.frame"))

Frank · Accepted Answer · 2019-07-24T18:09:13.260

You can use a non-equi join to update Trial in the gaze table:

gaze[, Trial := beh[.SD, on=.(StartTime <= Gaze, EndTime >= Gaze), x.Trial]]

    Gaze   x   y Frame Trial
 1: 0.00 100 200   126     a
 2: 0.20 101 201   126     a
 3: 0.40 102 202   127     a
 4: 0.80 103 203   127     a
 5: 0.60 104 204   127     a
 6: 0.90 105 205   127  <NA>
 7: 1.20 106 206   128     b
 8: 1.40 107 207   128     b
 9: 1.60 108 208   128     b
10: 2.02 109 209   129     c
11: 2.50 110 210   129     c
12: 2.90 111 211   129  <NA>
13: 3.10 112 212   130     d
14: 3.79 113 213   130     d

This approach assumes that there are no overlapping intervals in beh (in which case the right Trial could be ambiguous).

(OP didn't tag the question with data.table or include the library(data.table) call, but I'm assuming they're using it based on how the tables were printed.)

As a workaround for the .SD is locked error bug, I usually use copy(.SD) as recommended in the error message. However, as the OP pointed out in the comments, this can be expensive with large data. An alternative that is usually equivalent is to flip the join around:

# convert to correct NA type
gaze[, Trial := rep(beh$Trial[NA_integer_], .N)] 
# reversed update join
gaze[beh, on=.(Gaze >= StartTime, Gaze <= EndTime), Trial := i.Trial]

For the OP's case, it still seems to produce the right result. I usually avoid this kind of join because I find it harder to read and it can have strange side effects. In particular, in x[i, on=, v := i.v] if multiple rows of i map to the same row of x, only the last matching row will be used (with no warning or error).

This is perfect, thanks! Is there a way to do it for a larger set of values imported from a csv file? Currently I am importing the files with fread, and running the same non-equi statement, but I get the error: Error in set(i, j = lc, value = newval) : .SD is locked. Updating .SD by reference using := or set are reserved for future use. Use := in j directly. Or use copy(.SD) as a (slow) last resort, until shallow() is exported. Do you know the reason for this? Thanks again! — Channing Everidge Hambric, Jul 24 '19 at 18:00
@ChanningEveridgeHambric It is a bug, but I'm not sure why it happens in some cases and not others. I would use `copy(.SD)` in place of `.SD` as recommended in the error message unless it was far too slow, as might happen if the table is very large. I've edited the answer to show another option that works here and a link to the issue. — Frank, Jul 24 '19 at 18:12

Conditionally merging data from two data frames

1 Answers1