2

I am writing a series of functions that use dplyr internally to manipulate data.

There are a number of places where I'd like to add new variables to the data set as I work with it. However, I am not sure how to name these new variables so as to avoid overwriting variables already in the data, given that I don't know what's in the data set being passed.

In base R I can do this:

df <- data.frame(a = 1:5)

df[, ncol(df)+1] <- 6:10

and it will select a name for the newly-added variable that doesn't conflict with any existing names. I'd like to do this in dplyr rather than breaking up the consistent application of dplyr to go back to base-R.

All the solutions I've thought of so far feel very kludgy, or require the use of a bunch of base-R futzing anyway that isn't any better than just adding the variable in base-R:

  1. Rename all the variables so I know what the names are
  2. Pull out the names() vector and use one of many methods to generate a name that isn't in the vector
  3. Error out if the user happens to have my internal variable names in their data (bad-practice Olympics!)

Is there a straightforward way to do this in dplyr? Getting it to work in mutate would be ideal, although I suppose bind_cols or tibble::add_column would also be fine.

Some things I have tried that don't work:

df <- data.frame(a = 1:5)

# Gives the new variable a fixed title which might already be in there
df %>% mutate(6:10)
df %>% tibble::add_column(6:10)
df %>% mutate(NULL = 6:10)

# Error
df %>% bind_cols(6:10)
df %>% mutate( = 6:10)
df %>% mutate(!!NULL := 6:10)

# And an example of the kind of function I'm looking at:
# This function returns the original data arranged in a random order
# and also the random variable used to arrange it
arrange_random <- function(df) {
  df <- df %>%
    mutate(randomorder = runif(n())) %>%
    arrange(randomorder)

  return(df)
}

# No naming conflict, no problem!
data <- data.frame(a = 1:5)
arrange_random(data)

# Uh-oh, the original data gets lost!
data <- data.frame(randomorder = 1:5)
arrange_random(data)
NickCHK
  • 1,093
  • 7
  • 17
  • Can you please include a min reprex and include what you have tried and some sample data? https://stackoverflow.com/help/minimal-reproducible-example – kstew Aug 08 '19 at 20:36
  • 1
    Can you let the user specify a non-conflicting prefix as an argument to your function, then add simple suffixes e.g. _1, _2, ...? You could have a default if the user doesn't specify that argument, and an error asking the user to re-run with that argument specified if the names conflict. Shouldn't be too confusing for them I'd think. – IceCreamToucan Aug 08 '19 at 20:36
  • @kstew I have added a small example and some failed attempts I have already tried. – NickCHK Aug 08 '19 at 20:40
  • @IceCreamToucan That's an interesting thought. I'd rather not require the user to do the work here, but if I can do that with the names() vector at the top of the function, and just pick the longest variable name from it, that might be a less kludgy way of doing it. – NickCHK Aug 08 '19 at 20:41
  • 1
    Nick, you say that you want to add new variables within a function, can you provide more details on the function? – kstew Aug 08 '19 at 20:49
  • @kstew Oh, sorry, I see what you mean. I've added an example function. – NickCHK Aug 08 '19 at 20:55
  • 1
    @NickCHK: something like this https://stackoverflow.com/a/48898288/786542? – Tung Aug 08 '19 at 20:56
  • @Tung The concern I have about doing that is I need to work ahead of time to make sure there aren't any conflicts with the modified version. For example if I'm tacking on .1 to the name, and the data has both x and x.1 already in it, I need to be careful to expand on x.1 and not x. So I'm still stuck working with names() – NickCHK Aug 08 '19 at 21:00

1 Answers1

2

I am posting this solution for now. This sounds like a case of not knowing one's data very well, so I think one good approach is to include an if-else statement in the function. The logic is that the user chooses some arbitrary new name to add as a suffix to their original variable name, but the function will return an error if the new name is already included in the original data. Otherwise, the function runs and returns the original data plus the newly mutated data.

df <- data.frame(a = 1:5, b=11:15, c=21:25)

# define function with if-else statement to catch any possible duplicates
addnew <- function(data,name='newvar'){
  if(sum(grepl(name,names(data),ignore.case=T))>0)
  {stop('Error! Possible duplicate names with your new variable names')} else{
  data1 <- data %>% mutate_all(list( ~ runif(n())))
  names(data1) <- paste0(names(data1),'_',name)
  bind_cols(data,data1)
    }
}

addnew(df,'new')

  a  b  c     a_new     b_new     c_new
1 1 11 21 0.2875775 0.0455565 0.9568333
2 2 12 22 0.7883051 0.5281055 0.4533342
3 3 13 23 0.4089769 0.8924190 0.6775706
4 4 14 24 0.8830174 0.5514350 0.5726334
5 5 15 25 0.9404673 0.4566147 0.1029247

# try with new data that should throw an error
df <- data.frame(a_new = 1:5,b=11:15,c=21:25)

addnew(df,'new')
Error in addnew(df, "new") : 
  Error! Possible duplicate names with your new variable names
kstew
  • 1,104
  • 6
  • 21