0

I have been trying to write a for loop in order to determine whether my time data falls within specific time ranges. I have gone through all related questions on stack overflow and so far this is where I have gotten:

Basically, I have one data frame with acoustic measures of vowels. For each vowel, I also have the time in seconds at which the participants uttered the vowel.

Then I have a second dataframe including time intervals. Those intervals correspond to time periods where the participant was talking and there was no overlapping noise. Those intervals therefore identify the vowels from my first dataframe that can be used in subsequent analyses because their acoustic measures are not contaminated by other noises

I need to create a new column ("target") in data frame 1 that indicates, for each participant and for each recording, whether YES or NO the vowel falls into one of the intervals from data frame 2.

these are the variables of interest in data frame 1:

    Participant RecordingNumber    time
1        FSO110               1  37.258
2        FSO110               1  37.432
3        FSO110               1  37.496
4        FSO110               1  38.138
5        FSO110               1  38.499
6        FSO110               1  42.124
7        FSO110               1  61.733
8        FSO110               1  61.924
9        FSO110               1  61.980
10       FSO110               1  62.260
11       FSO110               1  62.610
12       FSO110               1  62.943
13       FSO110               1 194.929
14       FSO110               1 195.403
15       FSO110               1 401.114
16       FSO110               1 401.341

these are the variables of interest in data frame 2:

Participant RecordingNumber    tmin    tmax 
FSO110       1                 445.695 447.250   
FSO110       1                 448.444 449.093   
FSO110       1                 452.990 453.292   
FSO110       1                 481.177 481.709   
FSO110       2                 41.202  41.511   
FSO110       2                 42.176  43.132   
FSO110       2                 44.640  47.710   
FSO110       2                 53.819  56.253   
FSO110       2                 113.453 114.803   
FSO110       2                 123.135 123.374

So far, I have gotten there:

# split dataframes by Participant and Recording Number
data1 <- split(data1, paste0(data1$Participant, data1$RecordingNumber))
data2 <- split(data2, paste0(data2$Participant, data2$RecordingNumber))

# loop through each element of each splitted df 
for (n in seq_along(data1)){
  for (m in seq_along(data2)){
    if(n == m){
    data_split[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
      for (i in 1:nrow(data2[[m]])){
          if(data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]){
            return(paste0("in"))}
        else{
          return(paste0("overlap"))}
          }
      }
    ))}
}

The function seems to work. However, it only works for i == 1 (rows of data2). Therefore, it correctly identifies time points from data 1 that fall into the first interval of each splitted element of data 2 but does not continue for other intervals.

Solutions I have tried:

  1. use ifelse instead of if statement
for (n in seq_along(data1)){
  for (m in seq_along(data2)){
    if (n == m){
      data1[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
        for (i in 1:nrow(data2[[m]])){
          ifelse((data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]), "in", "overlap")
        }
      }
      ))}}
}

However, this function returns NULL for each row of my new "target column".

  1. adding any() to my if statement:
for (n in seq_along(data_split)){
  for (m in seq_along(data_split_target)){
    if(n == m) {
    data_split[[n]][["target"]] = as.character(lapply(data_split[[n]][["time"]], FUN = function(x){
      for (i in 1:nrow(data_split_target[[m]])){
          if(any(data_split_target[[m]][["tmin"]])<=x & any(x<= data_split_target[[m]][["tmax"]])){
            return(paste0("in"))}
        else{
          return(paste0("overlap"))}
          }
      }
    ))}
}

Again, the function seems to work as it correctly creates a new "target" column with "in" and "overlap" rows but the function erroenously returns "in" row values even when the time point did not fall into one of the intervals.

Can someone help me? Many thanks!

  • could you add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Using `dput` or a `data.frame` command? Adding a MRE and an example of the desired output (in code form, not tables and pictures) makes it much easier for others to find and test an answer to your question. That way you can help others to help you! – dario Sep 28 '21 at 09:24
  • Does this answer your question? [Join tables by date range](https://stackoverflow.com/questions/23958975/join-tables-by-date-range) – dario Sep 28 '21 at 09:25
  • Thanks Dario! I have tried the sqldf package on Participant 1 Recording 1 ``` result = sqldf("select * from data1 left join data2 on data1.time between data2.tmin and data2.tmax") ``` It works perfectly! Now I am wondering whether I can add a grouping variable into this function? And thank you for these other suggestions, this was my first time asking a question on here! – paulinemaes Sep 28 '21 at 09:54
  • Found the solution Dario! Thank you very for guiding me into the sqldf() direction! – paulinemaes Sep 28 '21 at 10:06
  • 1
    You are welcome ! ;) Also, I think you [should/could accept your own answer](https://stackoverflow.com/help/self-answer) (thou only after 48h) That way your question is of use for others searchig for the same problem.. – dario Sep 28 '21 at 10:21
  • Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Community Sep 30 '21 at 01:16

2 Answers2

2

Solved it using sqldf package.

result_all = sqldf("select * from data1
                left join data2
                on data1.rec = data2.rec
                and data1.time between data2.tmin and data2.tmax")

Where $rec is a grouping variable identifying both Participant and Recording Number.

1

Here is a base R way using split/Map.
The data sets are split and then Map applies function f to each sub-df.

meas_split <- split(measures, list(measures$Participant, measures$RecordingNumber))
int_split <- split(intervals, list(intervals$Participant, intervals$RecordingNumber))

nms <- intersect(names(meas_split), names(int_split))
i <- match(names(meas_split[nms]), names(int_split[nms]))
j <- match(names(int_split[nms]), names(meas_split[nms]))

f <- function(X, Y){
  yes <- sapply(X[["time"]], \(x){
    x > Y[["tmin"]] & x < Y[["tmax"]]
  })
  as.integer(colSums(yes) > 0)
}

measures$target <- unlist(Map(f, meas_split[i], int_split[j]))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66