0

I'm looking for an easy way of joining a dataset in R to a lookup dataset. The lookup dataset has min, max and a label as below.

library(dplyr)

min_age <- c(0, .5, 1, 5, 17, 35, 50, 75)
max_age <- c(.5, 1, 5, 17, 35, 50, 75, 125)
age_lbl <- c("< 6 months", "6 months - 1 year", "1 year - 5 years", 
             "5 years - 17 years", "17 years - 35 years", "35 years - 50 years"
             , "50 years - 75 years", "> 75 years")
age_lbl <- as.factor(age_lbl)
lkp_df <- data.frame(min_age = min_age, max_age = max_age, age_grp_lb = age_lbl)

The dataframe would look like below

lkp_df
min_age max_age          age_grp_lb 
0.0     0.5              < 6 months
0.5     1.0              6 months - 1 year    
1.0     5.0              1 year - 5 years
5.0     17.0             5 years - 17 years
17.0    35.0             17 years - 35 years
35.0    50.0             35 years - 50 years
50.0    75.0             50 years - 75 years
75.0   125.0             > 75 years

I would now want to join it with a dataframe that has patient ages as below. The join should be based on min_age and max_age and obtain the age_grp_lbl values in the patient dataframe.

head(pat_df)   
pat_id   pat_age 
1001     14.397769
1002     66.694280
1003     53.628013
1004     58.782156
1005      5.032531
1006     16.430463

I have a roundabout way of dealing with this.

# Introduce dummy column to obtain a cartesian
lkp_df <- lkp_df %>%
  mutate(join = 1)

set.seed(6789)
pat_id <- seq(1001, 1075)
pat_age <- runif(75, 0, 95)

pat_df <- data.frame(pat_id, pat_age) %>%
  mutate(pat_age_yrs = as.integer(pat_age),
         pat_age_mths = as.integer(pat_age * 12)) %>%
  # Introduce dummy column to obtain a cartesian
  mutate(join = 1) %>%
  # Create a cartesian product with the join column
  inner_join(lkp_df) %>%
  # Filter to keep only required records
  filter(pat_age >= min_age & pat_age < max_age) %>%
  # Keep only necessary columns
  select(pat_id, pat_age, pat_age_yrs, pat_age_mths, age_grp_lb)

Can someone suggest a better way of dealing with similar situation. Thanks in advance. Masoud A R.

Masoud
  • 1
  • 1
  • If the intervals are contiguous, like here, you may also consider `cut`. – Henrik Jul 16 '17 at 10:35
  • @Jaap, thanks for your links. The fuzzyjoin and the data.table approaches were great. It never came up in my search because of the keywords I was using. – Masoud Jul 17 '17 at 04:42

0 Answers0