3

I have code that pulls and processes data from a database based upon a client name. Some clients may have data that does not include a specific column name, e.g., last_name or first_name. For clients that do not use last_name or first_name, I don't care. For clients that do use either of those fields, I need to mutate() those columns with toupper() so that I can join on those standardized fields later in the ETL process.

Right now, I'm using a series of if() statements and some helper functions to look into the names of a dataframe then mutate if they exist. I'm using if() statements because ifelse() is mostly vectorized and doesn't handle dataframes well.

library(dplyr)
set.seed(256)

b <- data.frame(id = sample(1:100, 5, FALSE), 
                col_name = sample(1000:9999, 5, FALSE), 
                another_col = sample(1000:9999, 5, FALSE))

d <- data.frame(id = sample(1:100, 5, FALSE), 
                col_name = sample(1000:9999, 5, FALSE), 
                last_name = sample(letters, 5, FALSE))

mutate_first_last <- function(df){

  mutate_first_name <- function(df){
    df %>%
      mutate(first_name = first_name %>% toupper())
  }

  mutate_last_name <- function(df){
    df %>%
      mutate(last_name = last_name %>% toupper())
  }


  n <- c("first_name", "last_name") %in% names(df)

  if (n[1] & n[2]) return(df %>% mutate_first_name() %>% mutate_last_name())
  if (n[1] & !n[2]) return(df %>% mutate_first_name())
  if (!n[1] & n[2]) return(df %>% mutate_last_name())
  if (!n[1] & !n[2]) return(df)

}

I get what I expect to get this way

> b %>% mutate_first_last()
  id col_name another_col
1 48     8318        6207
2 39     7155        7170
3 16     4486        4321
4 55     2521        8024
5 15     1412        4875
> d %>% mutate_first_last()
  id col_name last_name
1 64     7438         A
2 43     4551         Q
3 48     7401         K
4 78     3682         Z
5 87     2554         J

but is this the best way to handle this kind of task? To dynamically look to see if a column name exists in a dataframe then mutate it if it does? It seems strange to have to have multiple if() statements in this function. Is there a more streamlined way to process these data?

Shree
  • 10,835
  • 1
  • 14
  • 36
Steven
  • 3,238
  • 21
  • 50

2 Answers2

7

You can use mutate_at with one_of, both from dplyr. This will mutate column only if it matches with one of c("first_name", "last_name"). If no match, it will generate a simple warning but you can ignore or suppress it.

library(dplyr)

d %>%
  mutate_at(vars(one_of(c("first_name", "last_name")), toupper)

  id col_name last_name
1 19     7461         V
2 52     9651         H
3 56     1901         P
4 13     7866         Z
5 25     9527         U

# example with no match
b %>%
  mutate_at(vars(one_of(c("first_name", "last_name"))), toupper)

  id col_name another_col
1 34     9315        8686
2 26     5598        4124
3 17     3318        2182
4 32     1418        4369
5 49     4759        6680
Warning message:
Unknown variables: `first_name`, `last_name`

Here are a bunch of other ?select_helpers in dplyr -

These functions allow you to select variables based on their names.

starts_with(): starts with a prefix

ends_with(): ends with a prefix

contains(): contains a literal string

matches(): matches a regular expression

num_range(): a numerical range like x01, x02, x03.

one_of(): variables in character vector.

everything(): all variables.

Shree
  • 10,835
  • 1
  • 14
  • 36
4

Update dplyr 1.0.0

In dplyr 1.0, the scoped variants of mutate such as _at or _all were replaced by across().

In addition, the best tidy_select helper for this case is any_of as it will perform on the variables which exist, but ignores those that don't exist (without warning message).

As result, you can write the following:

# purrr syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), ~toupper(.x)))

# function name syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), toupper))

which both return the mutated column

  id col_name last_name
1 19     4398         Q
2 72     1135         S
3 54     9767         V
4 60     4364         K
5 35     1564         X

while

b %>% mutate(across(any_of(c("first_name", "last_name")), toupper))

ignores the columns and thus returns (without warning message):

  id col_name another_col
1 42     7601        4482
2 22     1773        7072
3 47     2719        5884
4  1     9595        5945
5 81     8044        3927 
Agile Bean
  • 6,437
  • 1
  • 45
  • 53