0

I have two dataframes in R which I need to use to find the two best matches (or one if only one possible or 0 if none possible) for each one exposed individual.

One dataframe (called df) is a dataframe with 1,000 observations, of which 34 observations are exposed and 966 are unexposed. It contains information on ID, Exposure, Sex, Birthyear, Colour, and start date. The start date is only available for the 34 exposed observations (that is what defines the exposed observations).

Here is a reproducible example of such a dataframe:

# Set the seed for reproducibility
set.seed(123)

# Create the ID column from 1 to 1000
id <- 1:1000

# Create the exposure column with 966 zeros and 34 ones
exposure <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(966/1000, 34/1000))

# Create the sex column with 55% Male (M) and 45% Female (F)
sex <- sample(c("Male", "Female"), size = 1000, replace = TRUE, prob = c(0.55, 0.45))

# Create the birthyear column with random numbers from 1961 to 1981
birthyear <- sample(1961:1981, size = 1000, replace = TRUE)

# Create the colour column with specified distribution
colour <- sample(c("green", "red", "blue", "yellow"), size = 1000, replace = TRUE, prob = c(0.35, 0.2, 0.3, 0.15))

# Create the date_start column with specified conditions
date_start <- ifelse(exposure == 0, NA, sample(seq(as.Date("2009-11-23"), as.Date("2020-05-11"), by = "days"), size = sum(exposure), replace = TRUE))

# Create the dataframe
df <- data.frame(ID = id, Exposure = exposure, Sex = sex, Birthyear = birthyear, Colour = colour, Date_Start = as.Date(date_start, origin = "1970-01-01"))

# Print the first few rows of the dataframe
head(df)

The other dataframe (called id_date_combinations) is a dataframe with all the 1,000 observations, with one row per ID, date from 2018-01-01 to 2021-12-31, and measurement "a", "b", and "c".

Here is a reproducible example of such a dataframe:

# Create a vector of all IDs
all_ids <- 1:1000

# Create a vector of all dates ranging from 2018-01-01 to 2021-12-31
all_dates <- seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "days")

# Generate all combinations of IDs and dates
id_date_combinations <- expand.grid(ID = all_ids, Date = all_dates)

# Repeat each combination three times for measurements "a", "b", and "c"
id_date_combinations <- id_date_combinations[rep(1:nrow(id_date_combinations), each = 3), ]
id_date_combinations$Measurement <- rep(c("a", "b", "c"), times = nrow(id_date_combinations) / 3)

id_date_combinations$Value <- sample(10000, size = nrow(id_date_combinations), replace = TRUE)

# Print the first few rows of the dataframe
head(id_date_combinations)

id_date_combinations %>% group_by(Measurement) %>% tally()

What now needs to be done is to for each exposed observation in the df dataframe match it to all unexposed individuals with the same Sex, Birthyear, Colour. These observations should then be checked in the id_date_combinations if they have a measurement for "a", "b", and "c" respectively, which should be within 730 days from the Date_Start of the exposed observation in the df dataframe.

When all observations fulfilling this criteria are identified for each exposed observation, the two (or one if only one available or zero if none available) unexposed observations with the closest distance to the exposed observation in the Value column for "a", "b", and "c" should be identified and creating a new dataframe with one column being the ID of the exposed observation and one column being the ID of the unexposed observation (if any) or observations. As such, each exposed ID can have either 0 rows (if no matched observation identified), 1 row (if one matched observation identified9 or 2 rows (if two matched observations identified).

The matching should be done without replacement, as such an unexposed observation can only be selected once.

The matching procedure should be done in chronological order, meaning first should the unexposed observations be matched to the exposed observation with the earliest Date_Start and so on.

Pontus Hedberg
  • 301
  • 1
  • 2
  • 9

0 Answers0