5

I was looking for a smart, or "tidier" way, to make use of a lookup table in the tidyverse, but could not find a satisfying solution.

I have a dataset and lookup table:

# Sample data
data <- data.frame(patients = 1:5,
                   treatment = letters[1:5],
                   hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
                   response = rnorm(5))

# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))

... where each row in the lookup table is the exact pattern for which I want to filter the first tibble (data).

The wanted result would look like this:

# A tibble: 3 x 4
  patients treatment hospital response
     <dbl> <chr>     <chr>       <dbl>
1     1.00 a         yyy       -0.275 
2     5.00 e         uuu       -0.0967

The easiest solution I came up with is something like this:

as.tibble(dat) %>% 
  filter(paste(hospital, patients) %in% paste(lookup$hospital, lookup$patients))

However, this must be something that a lot of people regularly do - is there a cleaner and more convienent way to do this (i.e. for more than two columns in your lookup table)?

Marco
  • 71
  • 1
  • 1
  • 6

1 Answers1

5

Since the default behavior of dplyr::inner_join() is to match on common columns between the two tibbles passed to the function and the lookup table consists of only the 2 key columns, the shortest code is as follows:

library(dplyr)

# Sample data
data <- tibble(patients = 1:5,
                   treatment = letters[1:5],
                   hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
                   response = rnorm(5))

# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))

data %>% inner_join(.,lookup)

...and the output:

> data %>% inner_join(.,lookup)
Joining, by = c("patients", "hospital")
# A tibble: 2 x 4
  patients treatment hospital response
     <dbl> <chr>     <chr>       <dbl>
1     1.00 a         yyy        -1.44 
2     5.00 e         uuu        -0.313
>

Because the desired output can be accomplished by a join on key columns across the tibbles, the paste() code in the OP is unnecessary.

Also note that inner_join() is the right type of join because the desired output is rows that match across both incoming tibbles, and the lookup table does not have duplicate rows. If the lookup table contained duplicate rows, then semi_join() would be the appropriate function, per the comments on the OP.

Len Greski
  • 10,505
  • 2
  • 22
  • 33