How to match multiple columns based on lookup table

Question

I have the following two data frames:

lookup <- data.frame(id = c("A", "B", "C"),
                     price = c(1, 2, 3))

results <- data.frame(price_1 = c(2,2,1),
                      price_2 = c(3,1,1))

I now want to go through all columns of results and add the respective matching id from lookup as new columns. So I first want to take the price_1 column and find the ids (here: "B", "B", "A") and add it as a new column to results and then I want to do the same for the price_2 column.

My real-life case would need to match 20+ columns, so I want to avoid a hard-coded manual solution and are looking for a dynamic approach, ideally in the tidyverse.

results <- results %>%
  left_join(., lookup, by = c("price_1" = "id")

would give me the manual solution for the first column and I could repeat this with the second column, but I'm wondering if I can do this automatically for all my results columns.

Expected output:

price_1 price_2 id_1 id_2
2       3       "B"  "C"
2       1       "B"  "A"
1       1       "A"  "A"

GKi · Answer 1 · 2023-04-15T05:36:39.873

You can use lapply and match to match multiple columns based on lookup table.

cbind(results, lapply(results, \(x) lookup$id[match(x, lookup$price)]))
#  price_1 price_2 price_1 price_2
#1       2       3       B       C
#2       2       1       B       A
#3       1       1       A       A

And with the desired column names using setNames and sub.

cbind(results, lapply(setNames(results, sub("price", "id", names(results))),
                      \(x) lookup$id[match(x, lookup$price)]))
#  price_1 price_2 id_1 id_2
#1       2       3    B    C
#2       2       1    B    A
#3       1       1    A    A

Or using unlist and format the result using a marix.

cbind(results, matrix(lookup$id[match(unlist(results, FALSE, FALSE),
                                      lookup$price)], nrow(results)))
#  price_1 price_2 1 2
#1       2       3 B C
#2       2       1 B A
#3       1       1 A A

Or using fastmatch.

library(fastmatch)
cbind(results, lapply(results, \(x) lookup$id[fmatch(x, lookup$price)]))

For other options have a look at fast R lookup table.

Benchmark

set.seed(0)
n <- 1e6
lookup <- data.frame(id = sample(LETTERS), price = sample(length(LETTERS)))
results <- data.frame(price_1 = sample(lookup$price, n, TRUE),
                      price_2 = sample(lookup$price, n, TRUE))

library(fastmatch)
library(dplyr)
bench::mark(check = FALSE,
            "GKi Lapply" = cbind(results, lapply(results, \(x) lookup$id[match(x, lookup$price)])),
"GKi unlist" = cbind(results, matrix(lookup$id[match(unlist(results, FALSE, FALSE),
                                                     lookup$price)], nrow(results))),
"GKi fastmatch" = cbind(results, lapply(results, \(x) lookup$id[fmatch(x, lookup$price)])),
"Ronak Shah Base" = {new_df <- results
names(new_df) <- paste0("id", seq_along(new_df))
new_df[] <- lookup$id[match(unlist(new_df), lookup$price)]
cbind(results, new_df)},
"Ronak Shah dplyr" = {bind_cols(results, results %>%  mutate_all(~lookup$id[match(., lookup$price)]))}
)

Result

  expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 GKi Lapply        14.39ms  20.73ms    31.9      30.6MB    19.9     16    10
2 GKi unlist        34.83ms  41.64ms    21.1      76.3MB    30.7     11    16
3 GKi fastmatch     10.98ms  12.14ms    63.5      22.9MB    21.8     32    11
4 Ronak Shah Base     1.09s    1.09s     0.917     176MB     4.58     1     5
5 Ronak Shah dplyr  48.31ms   55.1ms    17.8      35.6MB    11.8      9     6

In this case using lapply with fastmatch is the fastest of the compared methods.

score 2 · Accepted Answer · answered Dec 03 '19 at 12:59

2

We could unlist the dataframe and match directly.

new_df <- results
names(new_df) <- paste0("id", seq_along(new_df))
new_df[] <- lookup$id[match(unlist(new_df), lookup$price)]
cbind(results, new_df)

#  price_1 price_2 id1 id2
#1       2       3   B   C
#2       2       1   B   A
#3       1       1   A   A

In dplyr, we can do

library(dplyr)
bind_cols(results, results %>%  mutate_all(~lookup$id[match(., lookup$price)]))

answered Dec 03 '19 at 12:59

Ronak Shah

377,200
20
156
213

Nice.The dplyr solution is even ~30 times faster in my case than @GKi's solution. Not that it matters, as we are talking about milliseconds, but might be more relavant for larger datasets with 10k+ rows. – deschen Dec 05 '19 at 09:38

How to match multiple columns based on lookup table

2 Answers2

Linked

Related