13

I couldn't find a question similar to the one that I have here. I have a very large named list of named vectors that match column names in a dataframe. I would like to use the list of named vectors to replace values in the dataframe columns that match each list element's name. That is, the name of the vector in the list matches the name of the dataframe column and the key-value pair in each vector element will be used to recode the column.

Reprex below:

library(tidyverse)

# Starting tibble
test <- tibble(Names = c("Alice","Bob","Cindy"),
               A = c(3,"q",7),
               B = c(1,2,"b"),
               C = c("a","g",9))

# Named vector
A <- c("5" = "alpha", "7" = "bravo", "3" = "charlie", "q" = "delta")
B <- c("1" = "yes", "2" = "no", "b" = "bad", "c" = "missing")
C <- c("9" = "beta", "8" = "gamma", "a" = "delta", "g" = "epsilon")

# Named list of named vectors
dicts <- list("A" = A, "B" = B, "C" = C) # Same names as columns

I'm able to use mutate and specify the column and list item manually.

# Works when replacement vector is specified
test %>% 
  mutate(across(c("A"), 
                ~recode(., !!!dicts$A)))
#> # A tibble: 3 x 4
#>   Names A       B     C    
#>   <chr> <chr>   <chr> <chr>
#> 1 Alice charlie 1     a    
#> 2 Bob   delta   2     g    
#> 3 Cindy bravo   b     9

However, the following does not work:

# Does not work when replacement vector using column names
test %>% 
  mutate(across(c("A", "B", "C"), 
                ~recode(., !!!dicts$.)))

Error: Problem with mutate() input ..1. x No replacements provided. i Input ..1 is (function (.cols = everything(), .fns = NULL, ..., .names = NULL) ....

Additionally, I've found that map2_dfr works only when all non-recoded columns are specified:

# map2_dfr Sort of works, but requires dropping some columns
map2_dfr(test %>% select(names(dicts)), 
         dicts, 
         ~recode(.x, !!!.y))
#> # A tibble: 3 x 3
#>   A       B     C      
#>   <chr>   <chr> <chr>  
#> 1 charlie yes   delta  
#> 2 delta   no    epsilon
#> 3 bravo   bad   beta

I'm looking to recode columns using the names from the list, without dropping columns.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
hvgupta
  • 189
  • 12

9 Answers9

6

You can try the base R code below

idx <- match(names(dicts), names(test))
test[idx] <- Map(`[`, dicts, test[idx])

which gives

> test
# A tibble: 3 x 4
  Names A       B     C
  <chr> <chr>   <chr> <chr>
1 Alice charlie yes   delta
2 Bob   delta   no    epsilon
3 Cindy bravo   bad   beta
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
5

Below are three approaches:

First, we can make it work with dplyr::across in a custom function using dplyr::cur_column().

library(tidyverse)

myfun <- function(x) {
  mycol <- cur_column()
  dplyr::recode(x, !!! dicts[[mycol]])
}

test %>% 
  mutate(across(c("A", "B", "C"), myfun))

#> # A tibble: 3 x 4
#>   Names A       B     C      
#>   <chr> <chr>   <chr> <chr>  
#> 1 Alice charlie yes   delta  
#> 2 Bob   delta   no    epsilon
#> 3 Cindy bravo   bad   beta

A second option is to transform the dicts into a list of expression and then just splice it into mutate using the !!! operator:

expr_ls <-  imap(dicts, ~ quo(recode(!!sym(.y), !!!.x)))

test %>% 
  mutate(!!! expr_ls)

#> # A tibble: 3 x 4
#>   Names A       B     C      
#>   <chr> <chr>   <chr> <chr>  
#> 1 Alice charlie yes   delta  
#> 2 Bob   delta   no    epsilon
#> 3 Cindy bravo   bad   beta

Finally, in the larger tidyverse we could use purrr::lmap_at, but it makes the underlying function more complex than it needs to be:

myfun2 <- function(x) {
  x_nm <- names(x)
  mutate(x, !! x_nm := recode(!! sym(x_nm), !!! dicts[[x_nm]]))
}

lmap_at(test, 
        names(dicts),
        myfun2)
#> # A tibble: 3 x 4
#>   Names A       B     C      
#>   <chr> <chr>   <chr> <chr>  
#> 1 Alice charlie yes   delta  
#> 2 Bob   delta   no    epsilon
#> 3 Cindy bravo   bad   beta

Original data

# Starting tibble
test <- tibble(Names = c("Alice","Bob","Cindy"),
               A = c(3,"q",7),
               B = c(1,2,"b"),
               C = c("a","g",9))

# Named vector
A <- c("5" = "alpha", "7" = "bravo", "3" = "charlie", "q" = "delta")
B <- c("1" = "yes", "2" = "no", "b" = "bad", "c" = "missing")
C <- c("9" = "beta", "8" = "gamma", "a" = "delta", "g" = "epsilon")

# Named list of named vectors
dicts <- list("A" = A, "B" = B, "C" = C) # Same names as columns

Created on 2021-12-15 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • The `dplyr::across()` with `dplyr::cur_column()` worked the best. Thank you! – hvgupta Dec 21 '21 at 01:01
  • I will also note that I modified it a bit by naming the columns in a separate line, like so `matching_vars <- na.omit(names(dicts[names(test)]))` and then passing that to the first argument in `across()`. – hvgupta Dec 21 '21 at 01:03
3

Base R (should be translated easily to dplyr)

# Helper function
look_dict <- function(col, values) dicts[[col]][values]

# lapply
test[names(dicts)] <- lapply(names(dicts), \(col) look_dict(col, test[[col]]))

# magrittr and for loop to avoid repeating code
library(magrittr)
for (col in names(dicts)) test[[col]] %<>% look_dict(col, .)

# # A tibble: 3 x 4
#   Names A       B     C      
#   <chr> <chr>   <chr> <chr>  
# 1 Alice charlie yes   delta  
# 2 Bob   delta   no    epsilon
# 3 Cindy bravo   bad   beta   
s_baldur
  • 29,441
  • 4
  • 36
  • 69
2

One work around would be to use your map2_dfr code, but then bind the columns that are needed to the map2_dfr output. Though you still have to drop the names column.

library(tidyverse)

map2_dfr(test %>% select(names(dicts)),
         dicts,
         ~ recode(.x,!!!.y)) %>%
  dplyr::bind_cols(., Names = test$Names) %>%
  dplyr::select(4, 1:3)

Output

# A tibble: 3 × 4
  Names A       B     C      
  <chr> <chr>   <chr> <chr>  
1 Alice charlie yes   delta  
2 Bob   delta   no    epsilon
3 Cindy bravo   bad   beta 
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • 1
    Thank you, however my actual data as a large number of columns, so I'm looking for a solution where I don't have to drop columns that aren't being recoded. – hvgupta Dec 13 '21 at 13:36
2

using base R and recode:

for (x in names(dicts)) { test[[x]] <- do.call(recode, c(list(test[[x]]), dicts[[x]])) }

> test
# A tibble: 3 × 4
  Names A       B     C      
  <chr> <chr>   <chr> <chr>  
1 Alice charlie yes   delta  
2 Bob   delta   no    epsilon
3 Cindy bravo   bad   beta   

Also note that other solutions based on Map() or str_replace_all() only work because the test example only uses simple substitutions. If .default or .missing were used they would most probably fail.

Karl Forner
  • 4,175
  • 25
  • 32
2

Not a full answer, but I figured a benchmark of the (at the point of writing) existing solutions might be helpful. As with every benchmark YMMV:

As we see, sindri_baldur's base R version is actually the fastest

(code below)

bench::mark(
  karl_base_r(data, dicts),
  tim_across(data, dicts),
  tim_lmap(data, dicts),
  sotos_pivot(data, dicts),
  thomas_base_r(data, dicts),
  sindri_base_r(data, dicts),
  check = FALSE
)
#> # A tibble: 6 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 karl_base_r(data, dicts)    825.9us  968.9us     814.   428.17KB     6.25
#> 2 tim_across(data, dicts)      5.04ms   6.44ms     147.      2.4MB     4.15
#> 3 tim_lmap(data, dicts)        7.34ms   8.49ms     108.   106.06KB     4.17
#> 4 sotos_pivot(data, dicts)    12.79ms  14.58ms      60.6    1.26MB     4.18
#> 5 thomas_base_r(data, dicts)    392us  438.6us    1891.         0B     4.07
#> 6 sindir_base_r(data, dicts)  116.8us  136.7us    5793.         0B     4.11

Larger dataset

For a larger dataset, ThomasIsCoding base R version is a bit faster than Sindir's solution.

set.seed(15)
data_large <- data %>% sample_n(1e6, replace = TRUE)

bench::mark(
  karl_base_r(data_large, dicts),
  tim_across(data_large, dicts),
  tim_lmap(data_large, dicts),
  thomas_base_r(data_large, dicts),
  sindir_base_r(data_large, dicts),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 5 x 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 karl_base_r(data_large, dicts)      856ms   856ms      1.17   503.9MB     9.35
#> 2 tim_across(data_large, dicts)       647ms   647ms      1.55   504.9MB    10.8 
#> 3 tim_lmap(data_large, dicts)         809ms   809ms      1.24   503.6MB    11.1 
#> 4 thomas_base_r(data_large, dicts)    131ms   148ms      6.53    80.1MB     3.27
#> 5 sindir_base_r(data_large, dicts)    150ms   180ms      5.08    80.1MB     5.08

Code

library(tidyverse)
library(magrittr)

# Starting tibble
data <- tibble(Names = c("Alice","Bob","Cindy"),
               A = c(3,"q",7),
               B = c(1,2,"b"),
               C = c("a","g",9))

# Named vector
A <- c("5" = "alpha", "7" = "bravo", "3" = "charlie", "q" = "delta")
B <- c("1" = "yes", "2" = "no", "b" = "bad", "c" = "missing")
C <- c("9" = "beta", "8" = "gamma", "a" = "delta", "g" = "epsilon")

# Named list of named vectors
dicts <- list("A" = A, "B" = B, "C" = C) # Same names as columns


# function definitions 

karl_base_r <- function(data, dicts) {
  for (x in names(dicts)) 
    {data[[x]] <- do.call(recode, c(list(data[[x]]), dicts[[x]])) }
  
  data
}

tim_across <- function(data, dicts) {
  
  myfun <- function(x) {
    mycol <- cur_column()
    dplyr::recode(x, !!! dicts[[mycol]])
  }
  
  data %>% 
    mutate(across(c("A", "B", "C"), myfun))
}

tim_lmap <- function(data, dicts) {
  myfun2 <- function(x) {
    x_nm <- names(x)
    mutate(x, !! x_nm := recode(!! sym(x_nm), !!! dicts[[x_nm]]))
  }
  
  lmap_at(data, 
          names(dicts),
          myfun2)
}

sotos_pivot <- function(data, dicts) {
  data %>% 
    pivot_longer(-1) %>% 
    left_join(stack(dicts) %>% 
                rownames_to_column('value'),
              by = c('value',  'name' = 'ind')) %>% 
    pivot_wider(id_cols = -value, names_from = name, values_from = values)
}

thomas_base_r <- function(data, dicts) {
  idx <- match(names(dicts), names(data))
  data[idx] <- Map(`[`, dicts, data[idx])
  data
}

sindri_base_r <- function(data, dicts) {
  look_dict <- function(col, values) dicts[[col]][values]
  
  data[names(dicts)] <- lapply(names(dicts), \(col) look_dict(col, data[[col]]))
  data
}

Created on 2021-12-15 by the reprex package (v2.0.0)

David
  • 9,216
  • 4
  • 45
  • 78
  • 1
    Thanks for the benchmark. I'm suprised that `lmap` is much faster and closer to `across` than I expected. – TimTeaFan Dec 15 '21 at 11:48
  • TBH, I haven't used `lmap()` yet, but it seems useful for many usecases. – David Dec 15 '21 at 11:58
  • It's definitely useful when working with lists. When it was introduced, one likely use case were `data.frame` operations (also mentioned in the documentation), but I think with `dplyr::across` this use case became redundant, since `across` is much more powerful. – TimTeaFan Dec 15 '21 at 12:16
1

A solution with merging the two can be,

library(dplyr)
library(tidyr)

test %>% 
 pivot_longer(-1) %>% 
 left_join(stack(dicts) %>% 
             rownames_to_column('value'),
           by = c('value',  'name' = 'ind')) %>% 
 pivot_wider(id_cols = -value, names_from = name, values_from = values)

# A tibble: 3 x 4
#  Names A       B     C      
#  <chr> <chr>   <chr> <chr>  
#1 Alice charlie yes   delta  
#2 Bob   delta   no    epsilon
#3 Cindy bravo   bad   beta   
Sotos
  • 51,121
  • 6
  • 32
  • 66
1

Another option using purrr without having to go into complex tidyeval things.

library(purrr)
library(tibble)

test %>% 
  lmap_at(c("A", "B", "C"), 
          ~ as_tibble_col(dicts[[names(.x)]][unlist(.x)], names(.x)))

# # A tibble: 3 x 4
#   Names A       B     C      
#   <chr> <chr>   <chr> <chr>  
# 1 Alice charlie yes   delta  
# 2 Bob   delta   no    epsilon
# 3 Cindy bravo   bad   beta

This would be very easy if there was a modify2_at() or imodify_at() function or something, but here we are using lmap_at() as a work around.

0

EDITED

Here is a pipe friendly solution using qdap::mgsub. I am afraid neither stringr::str_replace_all nor stringi::stri_replace_first_fixed() seems to be working. Please see the comments for more information.

test %>% 
  mutate(across(
    c("A", "B", "C"),
    ~qdap::mgsub( names(dicts[[cur_column()]]), dicts[[cur_column()]], .x)
    ))
Jakub.Novotny
  • 2,912
  • 2
  • 6
  • 21