0

I have a dataframe where I would like to fill in values from column "reference column" into the column "trial_number" based on the following two conditions:

  • condition 1: value column "start_time" >= value column "a" and
  • condition 2: value column "end_time" <= value column "b"

The values should only be filled into column "trial_number" if both conditions are met at the same time.

One complication is that the columns "start_time" and "end_time" contain more values than columns "a", "b" and "reference_column". The idea is that each of the values in "reference_column" is filled in multiple rows of column "trial_number".

I as well have the columns "a", "b" and "reference_column" in a separate dataframe.

A simplified version of the dataframe looks like this:

 reference_column  a  b trial_number start_time end_time
1                 1 10 20           NA         12       18
2                 2 22 24           NA         22       23
3                 3 30 40           NA         23       24
4                 4 45 55           NA         31       32
5                NA NA NA           NA         35       36
6                NA NA NA           NA         36       37
7                NA NA NA           NA         37       39
8                NA NA NA           NA         46       51
9                NA NA NA           NA         47       53
10               NA NA NA           NA         50       54

In the end, the desired output (filled values in "trial_number" should look like this:

 reference_column  a  b trial_number start_time end_time
1                 1 10 20           01         12       18
2                 2 22 24           02         22       23
3                 3 30 40           02         23       24
4                 4 45 55           03         31       32
5                NA NA NA           03         35       36
6                NA NA NA           03         36       37
7                NA NA NA           03         37       39
8                NA NA NA           04         46       51
9                NA NA NA           04         47       53
10               NA NA NA           04         50       54

Since I am quite new to R, I would be very happy about any recommendations. I already tried to use a for loop in combination with an if-statement, but wasn't successful because of the unbalanced number of data entries and various conditions. Thanks a lot in advance!

sa-al
  • 3
  • 3
  • Hi sa-al, welcome to SO! Some thoughts. *I as well have the columns "a", "b" and "reference_column" in a separate dataframe.* Taking this info into consideration would change any helping approach significantly. Either include the other data frames or maybe generate a small [minimal but still complete reproducible example](https://stackoverflow.com/a/5963610/9462095). Any help you receive based on the current data will either be incomplete or unintentionally misleading. – Andre Wildberg Aug 16 '23 at 21:06
  • Hi Andre, Thanks a lot for your answer! What I wanted to express with this is that I have the exact same columns with same values stored in a separate dataframe. They contain the same information as the columns in the dataframe I have included here as a very simplified example. Just thought the separate dataframe might help if the problem cannot be solved with all information in one dataframe. – sa-al Aug 17 '23 at 07:53

2 Answers2

1

Perhaps this approach would be helpful?

Dataset:

library(tidyverse)

df <- data.frame(referenceColumn = seq(from =1, to = 10, by =1), 
                 a = c(10, 22, 30, 45, NA, NA, NA, NA, NA, NA), 
                 b = c(20, 24, 40, 55, NA, NA, NA, NA, NA, NA), 
                 trialNumber = rep(NA, 10), 
                 startTime = c(12, 22, 23, 31, 35, 36, 37, 46, 47, 50), 
                 endTime = c(18, 23, 24, 32, 36, 37, 39, 51, 53, 54))

Separate your reference table and dataset



dimDf <- df %>% 
  select(referenceColumn:b) 
              

factDf <- df %>% 
  select(trialNumber:endTime)

Use your reference table in a case_when()

dfNew <- factDf %>% 
  mutate(trialNumber = case_when(
    startTime >= dimDf$a[1] & endTime <= dimDf$b[1] ~ dimDf$referenceColumn[1], 
    startTime >= dimDf$a[2] & endTime <= dimDf$b[2] ~ dimDf$referenceColumn[2], 
    startTime >= dimDf$a[3] & endTime <= dimDf$b[3] ~ dimDf$referenceColumn[3], 
    startTime >= dimDf$a[4] & endTime <= dimDf$b[4] ~ dimDf$referenceColumn[4],
    TRUE ~ NA_real_
  ))

example

or keep all columns

dfNew <- df %>% 
  mutate(trialNumber = case_when(
    startTime >= dimDf$a[1] & endTime <= dimDf$b[1] ~ dimDf$referenceColumn[1], 
    startTime >= dimDf$a[2] & endTime <= dimDf$b[2] ~ dimDf$referenceColumn[2], 
    startTime >= dimDf$a[3] & endTime <= dimDf$b[3] ~ dimDf$referenceColumn[3], 
    startTime >= dimDf$a[4] & endTime <= dimDf$b[4] ~ dimDf$referenceColumn[4],
    TRUE ~ NA_real_
  ))

example 2

Susan Switzer
  • 1,531
  • 8
  • 34
  • Okay, I see the logic now. Really is two separate tables, and that was confusing. This is the right approach. – ScottyJ Aug 17 '23 at 14:12
  • Hi Susan, Thanks a lot for the suggested approach. My original reference dataframe has 324 rows, is there any way to perform this "startTime >= dimDf$a[1] & endTime <= dimDf$b[1] ~ dimDf$referenceColumn[1]" in an automated way for all 324 rows? And I tried your code and unfortunately received the following error: – sa-al Aug 17 '23 at 15:45
  • Error in `mutate()`: ℹ In argument: `trial_number = case_when(...)`. Caused by error in `case_when()`: ! `..2 (right)` must be a vector, not `NULL`. Backtrace: 1. fix_updated_n %>% ... 9. dplyr::case_when(...) – sa-al Aug 17 '23 at 15:45
  • Do you have an idea why this error is occurring? – sa-al Aug 17 '23 at 15:46
  • Hi! I do not have a variable named trial_number (snake case with underscore) in my example. Mine is named trialNumber (camelCase). Looks like just a little naming diff. – Susan Switzer Aug 17 '23 at 17:32
  • I'd imagine there is a way to scale up to 300+ rows in your reference table with a function, but that is beyond my skills. Perhaps when you update your variable names and get my example to work you could repost another inquiry about scaling up using my example? – Susan Switzer Aug 17 '23 at 17:47
  • 1
    Hi Susan, I could now fix the error, was simply a spelling mistake :) Only the scaling up still does not work / gives me a hard time. But might be a good idea to post another question with you example about that. Thanks a lot! – sa-al Aug 18 '23 at 16:42
0

My answer is going to be incomplete because I don't fully understand what you are doing with this statement:

One complication is that the columns "start_time" and "end_time" contain more values than columns "a", "b" and "reference_column". The idea is that each of the values in "reference_column" is filled in multiple rows of column "trial_number".

But I think I have some directional help:

library(tidyverse)

df <- tribble(~reference_column,  ~a,  ~b, ~trial_number, ~start_time, ~end_time,
               1, 10, 20,           NA,         12,       18,
               2, 22, 24,           NA,         22,       23,
               3, 30, 40,           NA,         23,       24,
               4, 45, 55,           NA,         31,       32,
              NA, NA, NA,           NA,         35,       36,
              NA, NA, NA,           NA,         36,       37,
              NA, NA, NA,           NA,         37,       39,
              NA, NA, NA,           NA,         46,       51,
              NA, NA, NA,           NA,         47,       53,
              NA, NA, NA,           NA,         50,       54)


df %>% 
  mutate(trial_number = if_else(start_time >= a & end_time <= b, 
                                reference_column,
                                trial_number))

# A tibble: 10 × 6
   reference_column     a     b trial_number start_time end_time
              <dbl> <dbl> <dbl>        <dbl>      <dbl>    <dbl>
 1                1    10    20            1         12       18
 2                2    22    24            2         22       23
 3                3    30    40           NA         23       24
 4                4    45    55           NA         31       32
 5               NA    NA    NA           NA         35       36
 6               NA    NA    NA           NA         36       37
 7               NA    NA    NA           NA         37       39
 8               NA    NA    NA           NA         46       51
 9               NA    NA    NA           NA         47       53
10               NA    NA    NA           NA         50       54

In your example output, you find some way to put values for "trial_number" even where df$a and df$b are == NA. Please either explain the additional logic in your question, or hopefully this is enough to give you some ideas how to proceed.

ScottyJ
  • 945
  • 11
  • 16
  • Hi wackojacko1997, Thanks a lot for your answer and suggested approach. I tried it on my data and it unfortunately doesn't produce the intended result since the column "trial_number" is still empty after running the code. It does not lead to any errors though. – sa-al Aug 17 '23 at 08:04
  • To explain what I meant about the unbalanced number of values: I thought this could be an issue that the values to be filled into "trial_number" are not always the same index of the dataframe than the values to be filled in from column "reference_column". Same with the columns / values to be compared in "a", "b", "start_time", and "end_time". – sa-al Aug 17 '23 at 08:04
  • To give an example: row 7 from "trial_number" needs to be filled with the value from row 3 from "reference_column" (=3) since "start_time" in row 7 >= "a" in row 3 and "end_time" in row 7 is <= "b" in row 3. Hope this makes sense and sorry for the complicated explanations. Could as well be that this is not an issue - as already mentioned, I am still in the progress of learning! – sa-al Aug 17 '23 at 08:04
  • @sa-al, I still don't follow the logic you're applying. Why are you comparing row 3 with row 7? Are these related as `row_number() %% 4` ? Why don't you compare row 3 with row 5, for example? Potentially if this is a modulo relationship, you could create `new_column = row_number() %% 4`, then `group_by(new_column)` and use `fill()` to carry forward values from row 3 to row 7... I'm just not sure what the logic really is that you're using, but you could look at those to see if it helps you. – ScottyJ Aug 17 '23 at 13:14