5

I'm aware of similar questions here and here, but I haven't been able to figure out the right solution for my specific situation. Some of what I'm finding are solutions which use mutate_, etc but I understand these are now obsolete. I'm new to dynamic usages of dplyr.

I have a dataframe which includes some variables with two different prefixes, alpha and beta:

df <- data.frame(alpha.num = c(1, 3, 5, 7),
             alpha.char = c("a", "c", "e", "g"),
             beta.num = c(2, 4, 6, 8),
             beta.char = c("b", "d", "f", "h"),
             which.to.use = c("alpha", "alpha", "beta", "beta"))

I want to create new variables with the prefix "chosen." which are copies of either the "alpha" or "beta" columns depending on which is named for that row in the "which.to.use" column. The desired output would be:

desired.df <- data.frame(alpha.num = c(1, 3, 5, 7),
                     alpha.char = c("a", "c", "e", "g"),
                     beta.num = c(2, 4, 6, 8),
                     beta.char = c("b", "d", "f", "h"),
                     which.to.use = c("alpha", "alpha", "beta", "beta"),
                     chosen.num = c(1, 3, 6, 8),
                     chosen.char = c("a", "c", "f", "h"))

My failed attempt:

varnames <- c("num", "char")
df %<>%
  mutate(as.name(paste0("chosen.", varnames)) := case_when(
    which.to.use == "alpha" ~ paste0("alpha.", varnames),
    which.to.use == "beta" ~ pasteo("beta.", varnames)
  ))

I'd prefer a pure dplyr solution, and even better would be one which could be included in a longer pipe modifying the df (i.e. no need to stop to create "varnames"). Thanks for your help.

lost
  • 1,483
  • 1
  • 11
  • 19

4 Answers4

6

Using some fun rlang stuff & purrr:

library(rlang)
library(purrr)
library(dplyr)

df <- data.frame(alpha.num = c(1, 3, 5, 7),
                 alpha.char = c("a", "c", "e", "g"),
                 beta.num = c(2, 4, 6, 8),
                 beta.char = c("b", "d", "f", "h"),
                 which.to.use = c("alpha", "alpha", "beta", "beta"),
                 stringsAsFactors = F)

c("num", "char") %>% 
    map(~ mutate(df, !!sym(paste0("chosen.", .x)) := 
      case_when(
          which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
          which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
                ))) %>% 
    reduce(full_join)

Result:

  alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1         1          a        2         b        alpha          1           a
2         3          c        4         d        alpha          3           c
3         5          e        6         f         beta          6           f
4         7          g        8         h         beta          8           h

Without reduce(full_join):

c("num", "char") %>% 
  map_dfc(~ mutate(df, !!sym(paste0("chosen.", .x)) := 
                 case_when(
                   which.to.use == "alpha" ~ !!sym(paste0("alpha.", .x)),
                   which.to.use == "beta" ~ !!sym(paste0("beta.", .x))
                 ))) %>% 
  select(-ends_with("1"))



alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
1         1          a        2         b        alpha          1           a
2         3          c        4         d        alpha          3           c
3         5          e        6         f         beta          6           f
4         7          g        8         h         beta          8           h

Explanation:
(Note: I do not fully or even kind of get rlang. Maybe others can give a better explanation ;).)

Using paste0 by itself produces a string, when we need a bare name for mutate to know it is referring to a variable name.

If we wrap paste0 in sym, it evaluates to a bare name:

> x <- varrnames[1]
> sym(paste0("alpha.", x))
  alpha.num

But mutate does not know to evaluate and instead read it as a symbol:

> typeof(sym(paste0("alpha.", x)))
[1] "symbol"

The "bang bang" !! operator evaluates the sym function. Compare:

> expr(mutate(df, var = sym(paste0("alpha.", x))))
mutate(df, var = sym(paste0("alpha.", x)))

> expr(mutate(df, var = !!sym(paste0("alpha.", x))))
mutate(df, var = alpha.num)

So with !!sym we can use paste to dynamically called variable names with dplyr.

paqmo
  • 3,649
  • 1
  • 11
  • 21
  • 1
    Thanks. My actual data are very large (both tall and wide), and the full_join brings things to a halt... I'm wondering if there's a way to do it using mutate_at or mutate_if... – lost Aug 02 '18 at 06:39
  • 1
    check out my `nest()/map()` answer for a faster `tidyverse` option – Nate Aug 02 '18 at 11:42
  • 1
    @Nate nice. Using `map_dfc` lets you get rid of the `full_join`. – paqmo Aug 02 '18 at 14:19
4

This is a nest()/map() strategy that should be pretty fast. It stays in the tidyverse, but doesn't go into rlang land.

library(tidyverse)

df %>% 
    nest(-which.to.use) %>%
    mutate(new_data = map2(data, which.to.use,
                       ~ select(..1, matches(..2)) %>%
                           rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
    unnest()

  which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1        alpha         1          a        2         b           1            a
2        alpha         3          c        4         d           3            c
3         beta         5          e        6         f           6            f
4         beta         7          g        8         h           8            h

It grabs all columns, not just num and char, that are not which.to.use. But that seems like what you (I) would want IRL. You could add a select(matches('(var1|var2|etc')) line before you call nest() if you wanted to pull only specific variables.

EDIT: My original suggestion of using select() to drop unneeded columns would result in doing a join to bring them back later. If instead you adjust the nest parameters, you can acheive this on only certain columns.

I added new bool columns here, but they will be ignored for the "choosen" selection:

new_df <- data.frame(alpha.num = c(1, 3, 5, 7),
                 alpha.char = c("a", "c", "e", "g"),
                 alpha.bool = FALSE,
                 beta.num = c(2, 4, 6, 8),
                 beta.char = c("b", "d", "f", "h"),
                 beta.bool = TRUE,
                 which.to.use = c("alpha", "alpha", "beta", "beta"),
                 stringsAsFactors = FALSE)

new_df %>% 
    nest(matches("num|char")) %>% # only columns that match this pattern get nested, allows you to save others
    mutate(new_data = map2(data, which.to.use,
                           ~ select(..1, matches(..2)) %>%
                               rename_all(funs(gsub(".*\\.", "choosen.", .) )))) %>%
    unnest()

  alpha.bool beta.bool which.to.use alpha.num alpha.char beta.num beta.char choosen.num choosen.char
1      FALSE      TRUE        alpha         1          a        2         b           1            a
2      FALSE      TRUE        alpha         3          c        4         d           3            c
3      FALSE      TRUE         beta         5          e        6         f           6            f
4      FALSE      TRUE         beta         7          g        8         h           8            h
Nate
  • 10,361
  • 3
  • 33
  • 40
  • I only want to do this on specific variables. Adding a select statement would reduce the df, so I'd have to do a join later to bring back the other variables (if I understand correctly). There must be a better way, right? – lost Aug 05 '18 at 07:53
  • 1
    I made an edit to show a better way with just adjusting the `nest()` parameters, let me know if that's what you had in mind – Nate Aug 05 '18 at 11:34
  • 1
    Awesome! Works great. I don't quite understand everything going on here, but that's homework for me. Thanks. – lost Aug 06 '18 at 02:29
2

A base R approach using apply with margin = 1 where we select columns for each row based on the value in which.to.use column and get the value from corresponding column for the row.

df[c("chosen.num", "chosen.char")] <- 
          t(apply(df, 1, function(x) x[grepl(x["which.to.use"], names(df))]))

df
#  alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
#1         1          a        2         b        alpha          1           a
#2         3          c        4         d        alpha          3           c
#3         5          e        6         f         beta          6           f
#4         7          g        8         h         beta          8           h
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

You can also try a gather/spread approach

df %>% 
  rownames_to_column() %>% 
  gather(k,v,-which.to.use,-rowname) %>% 
  separate(k,into = c("k1", "k2"), sep="[.]") %>% 
  filter(which.to.use == k1) %>% 
  mutate(k1="chosen") %>% 
  unite(k, k1, k2,sep=".") %>% 
  spread(k,v) %>%
  select(.,chosen.num, chosen.char) %>% 
  bind_cols(df, .)
    alpha.num alpha.char beta.num beta.char which.to.use chosen.num chosen.char
 1         1          a        2         b        alpha          1           a
 2         3          c        4         d        alpha          3           c
 3         5          e        6         f         beta          6           f
 4         7          g        8         h         beta          8           h
Roman
  • 17,008
  • 3
  • 36
  • 49