0

Problem: I have 2 datasets with no matching identifiers (like ID) and need to find the closest match in df1$time to df2$tstart. df1 (with time column) has 660,000 rows with time stamps approximately every 0.00125 s. Whatever the closest match is to df2$tstart, I would like a new column made (df1$trial_start) that says "yes", otherwise "no".

I've tried findInterval, but it only seems to match in ascending order, and doesn't check values in both directions. In the below code, it looks good for most of the outputs, but there are some indices where the value after the returned index is closer to $tstart

#my actual code: 
index_closest <- findInterval(iti_summaries_2183[["24"]]$tstart, poke_1s$time)
poke_1s$trial_start <- ifelse(seq_len(nrow(poke_1s)) %in% index_closest, "yes", "no")

I've also tried which.min, which doesn't work since the lists lengths don't match. Additionally, I've fought with roll = "nearest" like here but the functions return values and I'm not sure how to create a new column and assign y/n.

Code to replicate problem:

n <- 773
df1 <- structure(list(initiate = sample(c(0,1), replace=TRUE, size=n), 
                      left = sample(c(0,1), replace=TRUE, size=n), 
                      right = sample(c(0,1), replace=TRUE, size=n), 
                      time = seq(from = 2267.2, to = 2363.75, by = 0.125)))

df1 <- data.frame(df1)
                
df2 <- structure(list(trial = c(156:162), 
                      control = c(0, 0, 0, 0, 3, 0, 3), 
                      t_start = c(2267.231583, 2289.036355, 2298.046849, 2318.933635, 2328.334036, 2347.870449, 2363.748095), 
                      t_end = c(2268.76760, 2290.83370, 2299.38547, 2320.71400, 2329.93985, 2349.15464, 2365.12455)), 
                 class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
    cols = list(trial = structure(list(), class = c("collector_double", 
    "collector")), control = structure(list(), class = c("collector_double", 
    "collector")), t_start = structure(list(), class = c("collector_double", 
    "collector")), t_end = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))
Amanda
  • 49
  • 6

1 Answers1

0

If I understand your question correctly :

library(data.table)

setDT(df1)
setDT(df2)

df1[df2,.(initiate,left,right,x.time,trial,control,t_start,t_end,
          trial_start=fifelse(x.time>t_start&x.time<t_end,'Y','N')),
    on=.(time=t_start),roll='nearest']

   initiate  left right   x.time trial control  t_start    t_end trial_start
      <num> <num> <num>    <num> <int>   <num>    <num>    <num>      <char>
1:        0     0     1 2267.200   156       0 2267.232 2268.768           N
2:        0     0     1 2289.075   157       0 2289.036 2290.834           Y
3:        0     0     1 2298.075   158       0 2298.047 2299.385           Y
4:        1     1     1 2318.950   159       0 2318.934 2320.714           Y
5:        1     1     1 2328.325   160       3 2328.334 2329.940           N
6:        0     0     1 2347.825   161       0 2347.870 2349.155           N
7:        1     1     0 2363.700   162       3 2363.748 2365.125           N
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • A couple things here. It looks like time in df1 is listed in this new output, but they also look very close to t_start, so I imagine they're closest to t_start and actually should be labeled Y. I'd also like the output of trial_start (y/n) to be assigned to df2. Here I lose all of the time data except those 7 trials. – Amanda Sep 28 '22 at 18:45