2

My dataframe in R studio is as follows:

StudyID FITDate.1 ScopeDate.1 ScopeDate.2 ScopeDate.3 ScopeDate.4
1 2014-05-15 2010-06-02 2014-05-28 2014-08-01 2015-10-27
2 2017-11-29 2018-02-27     
3 2015-10-04 2016-06-24 2017-01-18

I have a variable "FITDate.1" indicates the date for FIT test, and several variables "ScopeDate.x" indicates the dates for multiple scope tests.

In my research, a person can have only one date for FIT test, but can have multiple dates for scope. Clinically, if a person has a FIT test, then he will be referred to undertake scope test. However, this person may receive scope tests for other reasons.

So if the date of a scope test is right after the date of a FIT test, then we will define them highly related.

I want to create a variable "FITrelatedscopedate" to include the dates of FIT related scopes. For example, in the row of StudyID==1, the date of "FITDate.1"is 2014-05-15, which is right between ScopeDate.1 (2010-06-02) and ScopeDate.2 (2014-05-28). So the date value 2014-05-28 of ScopeDate.2 is what i need, and I will use 2014-05-28 as the FIT related scope date and write it in the new variable "FITrelatedscopedate".

I think I have to use loop syntax, but i had no experience to realize it. Do you have any experience to solve similar problem? Do you know any codes to realize it? Thanks, any help are appreciated.

George Savva
  • 4,152
  • 1
  • 7
  • 21
Pathfinder
  • 21
  • 3
  • Is there a threshold for how close the Scope date needs to be to the fit date? What would you want if there were more that one scope date that was close to the fit date? – George Savva Apr 08 '22 at 07:36
  • @George Savva Hi George, thanks for your comments. Currently, there is no report on this threshold. If there are more than one scope date, then I will use the one that are later than FIT date and most closest to FIT date. Thanks. – Pathfinder Apr 08 '22 at 15:01
  • @George Savva For example, the first row, the date of "FITDate.1"is 2014-05-15, therefore ScopeDate.2 (2014-05-28), ScopeDate.3(2014-08-01), ScopeDate.4 (2015-10-27) are all after the FITDate.1. But I will use the ScopeDate.2 (2014-05-28), which is the most closest date to FITDate.1. Thanks. – Pathfinder Apr 08 '22 at 15:08
  • @Pathfinder What happens if all dates precede FIT testing - then should the result be `NA` (missing)? This might be easier to manage if the data.frame is first converted to long form (using something like `pivot_longer` from `tidyr`). If you still need assistance on this, let me know. – Ben Apr 10 '22 at 22:41
  • @Ben, you are correct. if all scope dates precede a FIT test date, then the result will be NA (missing). My original data frames are two separate long-format data frames, one only including FIT date, and the other one only including scope dates. I converted them to wide and linked them together because I thought wide format might be easier to realize my calculation. I have to look for the corresponding scope observation to each FIT observation and then calculate the time duration between them. – Pathfinder Apr 11 '22 at 04:03

1 Answers1

0

Here is one approach with tidyverse assuming you start with two long data.frames, one for FIT testing, and the other for endoscopy.

df_fit <- data.frame(
  StudyID = 1:3,
  FITDate = as.Date(c("2014-05-15", "2017-11-29", "2015-10-04"))
)
df_fit

  StudyID    FITDate
1       1 2014-05-15
2       2 2017-11-29
3       3 2015-10-04

df_scope <- data.frame(
  StudyID = c(1,1,1,1,2,3,3),
  ScopeDate = as.Date(c("2010-06-02", "2014-05-28", "2014-08-01", "2015-10-27", "2018-02-27", 
                        "2016-06-24", "2017-01-18"))
)
df_scope

  StudyID  ScopeDate
1       1 2010-06-02
2       1 2014-05-28
3       1 2014-08-01
4       1 2015-10-27
5       2 2018-02-27
6       3 2016-06-24
7       3 2017-01-18

First, you can do a left_join by the StudyID to add the scope dates to the FIT data. Then, you can filter to only keep scope dates after FIT testing. For each StudyID, use slice to retain only the first row (this assumes dates are in chronological order...if not, add arrange(ScopeDate) first in the pipe - let me know if you need help with this).

Then, you can right_join back to df_fit so that those FIT testing dates without endoscopy will have NA for the ScopeDate. The final statement with mutate will calculate the time duration between endoscopy and FIT testing.

library(tidyverse)

left_join(
  df_fit,
  df_scope,
  by = "StudyID"
) %>%
  filter(ScopeDate > FITDate) %>%
  group_by(StudyID) %>%
  slice(1) %>%
  right_join(df_fit) %>%
  mutate(Duration = ScopeDate - FITDate)

Output

  StudyID FITDate    ScopeDate  Duration
    <dbl> <date>     <date>     <drtn>  
1       1 2014-05-15 2014-05-28  13 days
2       2 2017-11-29 2018-02-27  90 days
3       3 2015-10-04 2016-06-24 264 days

Let me know if this works for you. A data.table approach can be considered if you need something faster and have a very large dataset.

If you need the Duration as a numeric column, you can use as.numeric(ScopeDate - FITDate).

Ben
  • 28,684
  • 5
  • 23
  • 45
  • after reading your logic and codes, I think your codes should work. Can I ask you a further question? In my posted sample, a studyID has only one FIT date, but in reality, a single studyID may have multiple FIT dates also. For example, |Study ID| FITDate.1| FITDate.2 | FITDate.3| FITDate.4 | FITDate.5 | ScopeDate.1 | ScopeDate.2 |ScopeDate.3 | 4 | 2014-11-28 | 2016-11-16 | 2017-05-31 | 2018-05-23 | 2019-03-18 | 2014-12-12 | 2017-02-24 | 2018-06-28. Should I repeat your codes for different columns of FIT dates (FITDate.1, FITDate.2...)? Or do you have better solution? – Pathfinder Apr 12 '22 at 20:42
  • in studyID =4, both FITDate.4 (2017-05-31) and FITDate.5 (2018-05-23) are between ScopeDate.2 (2017-02-24) and ScopeDate.3 (2018-06-28). These two FIT test correspond to the ScopeDate.3 of 2018-06-28, but I will only use FITDate.4 to calculate the FIT to endoscopy time duration: ScopeDate.3-FITDate.4. Because, clinically, doctors may prescribe another FITs for a person with FIT (+) result, but in my research, i am only interested in the time duration from first FIT (+) to scope. – Pathfinder Apr 12 '22 at 21:13
  • If you need more information, please let me know. I will update my progress of applying your codes. – Pathfinder Apr 12 '22 at 21:14
  • @Pathfinder This is more complicated to explain in the comments. I would recommend editing your original question, and include example data that includes multiple FIT dates and multiple scope dates. I would recommend you provide what your original data looks like - if it is two long data.frames, include that. If it is one wide data.frame, then include that... – Ben Apr 12 '22 at 23:08
  • In addition, make sure your data includes different scenarios/examples. Some with multiple FIT dates and scopes. Finally, include what your final desired data.frame should look like in the end. It should have the time duration for each `StudyID`. The end result is also important. Please see [this post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making a great reproducible example. If I can see your original data, and what your final outcome looks like, I can help you with the code. – Ben Apr 12 '22 at 23:10