2

I am struggling to pass column names inside my custom function while using dplyr - mutate_at. I have a dataset "dt" with thousands of columns and I would like to perform mutate for some of these columns, but in a way which is dependent on the column name

I have this piece of code

Option 1:

relevantcols = c("A", "B", "C")
myfunc <- function(colname, x) {
   #write different logic per column name
}
dt%>%
  mutate_at(relevantcols, funs(myfunc(<what should i give?>,.)))

I tried approaching the problem in another way, i.e by iterating over relevantcols and applying mutate_at for each of the elements of the vector as follows

Option 2:

for (i in 1:length(relevantcols)){
  dt%>%
  mutate_at(relevantcols[i], funs(myfunc(relevantcols[i], .))
}

I get the colnames in Option 2, but it is 10 times slower than Option 1. Can I get somehow the column names in Option 1?

Adding an example for more clarity

df = data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))
df
 employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     5.234383     6.857227     4.480943     7.233947
2        2     7.441399     3.777524     2.134075     6.310293
3        3     7.686558     8.598688     9.814882     9.192952
4        4     6.033345     5.658716     5.167388     3.018563
5        5     5.595006     7.582548     9.302917     6.071108
relevantcols = c("Mon_channelA", "Mon_channelB")
myfunc <- function(colname, x) {
#based on the channel and weekday, compare the data from corresponding column with  the same channel but different weekday and return T if higher else F
}
# required output
employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
1        1     T     F     4.480943     7.233947
2        2     T     F     2.134075     6.310293
3        3     F     F     9.814882     9.192952
4        4     T     T     5.167388     3.018563
5        5     F     T     9.302917     6.071108
James Z
  • 12,209
  • 10
  • 24
  • 44
  • Could you give a concrete example of the "different logic per column name", so we have an example function to work with? It will also help to have an example dataset along with your desired output. You can see ideas for how to make a reproducible example [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – aosmith Sep 28 '18 at 14:02
  • Done, hope it helps in understanding the situation – Tanmoy Chatterjee Sep 28 '18 at 14:25
  • I'm unclear on part of the logic of the output you want. For Monday you want true/false values, but for Tuesday you want the values corresponding to Tuesdays? Why not all true/false or all numeric? – camille Sep 28 '18 at 14:56
  • @camille you are right. In the end all values will be boolean or numeric. so tuesday's values will have to be compared with wednesdays with some exceptions on weekends and so on....i just wanted to explain the point of having the requirement of passing the column name into myfunc...note that in the original dataset, I have more than 5000 columns with various combinations of channels and weekdays. – Tanmoy Chatterjee Sep 28 '18 at 15:08

3 Answers3

1

I left a comment about data types, but assuming that that is what you're looking for, here's the approach I take to these sorts of problems. I do this in a seemingly convoluted process of reshaping a few times, but it lets you set up the variables that you're trying to compare without hard-coding much. I'll break it into pieces.

library(tidyverse)

set.seed(928)
df <- data.frame(employee=seq(1:5), Mon_channelA=runif(5,1,10), Mon_channelB=runif(5,1,10), Tue_channelA=runif(5,1,10),Tue_channelB=runif(5,1,10))

First, I'd reshape it into a long shape and break the "Mon_channelA", etc apart into a day and a channel. This lets you use the channel designation to match values for comparison.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  head()
#>   employee day  channel    value
#> 1        1 Mon channelA 2.039619
#> 2        2 Mon channelA 8.153684
#> 3        3 Mon channelA 9.027932
#> 4        4 Mon channelA 1.161967
#> 5        5 Mon channelA 3.583353
#> 6        1 Mon channelB 7.102797

Then, bring it back into a wide format based on the days. Now you have a column for each day for each combination of employee and channel.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  head()
#>   employee  channel      Mon      Tue
#> 1        1 channelA 2.039619 9.826677
#> 2        1 channelB 7.102797 7.388568
#> 3        2 channelA 8.153684 5.848375
#> 4        2 channelB 6.299178 9.452274
#> 5        3 channelA 9.027932 5.458906
#> 6        3 channelB 7.029408 7.087011

Then do your comparison, and take the data long again. Note that because the value column has numeric values, everything becomes numeric and the logical values are converted to 1 or 0.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  head()
#>   employee  channel day value
#> 1        1 channelA Mon     0
#> 2        1 channelB Mon     0
#> 3        2 channelA Mon     1
#> 4        2 channelB Mon     0
#> 5        3 channelA Mon     1
#> 6        3 channelB Mon     0

Last few steps are to stick the day and channel back together to make the labels as you had them, spread back to a wide format, and turn all the columns starting with "Mon" back into logicals.

df %>%
  gather(key, value, -employee) %>%
  separate(key, into = c("day", "channel"), sep = "_") %>%
  spread(key = day, value = value) %>%
  mutate(Mon = Mon > Tue) %>%
  gather(key = day, value = value, Mon, Tue) %>%
  unite("variable", day, channel) %>%
  spread(key = variable, value = value) %>%
  mutate_at(vars(starts_with("Mon")), as.logical)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1        FALSE        FALSE     9.826677     7.388568
#> 2        2         TRUE        FALSE     5.848375     9.452274
#> 3        3         TRUE        FALSE     5.458906     7.087011
#> 4        4        FALSE        FALSE     8.854263     8.946458
#> 5        5        FALSE        FALSE     6.933054     8.450741

Created on 2018-09-28 by the reprex package (v0.2.1)

camille
  • 16,432
  • 18
  • 38
  • 60
  • thanks, will try out your approach; the only concern I had was performance in doing the gather was that I could end up having hundreds of millions of rows and was not sure on how subsequent transformations would be impacted. but nevertheless will try out and get back with the result. – Tanmoy Chatterjee Sep 28 '18 at 15:37
0

You can do things like :

L <- c("A","B")
df <- data.frame(A=rep(1:3,2),B=1:6,C=7:12)
df
#  A B  C
#1 1 1  7
#2 2 2  8
#3 3 3  9
#4 1 4 10
#5 2 5 11
#6 3 6 12

f <- function(x,y) x^y

df %>% mutate_at(L,funs(f(.,2)))
#  A  B  C
#1 1  1  7
#2 4  4  8
#3 9  9  9
#4 1 16 10
#5 4 25 11
#6 9 36 12
Nicolas2
  • 2,170
  • 1
  • 6
  • 15
0

This is an old question, but I just stumbled over one possible way to solve it using a custom mutate/case_when function in combination with purrr::reduce.

It's important to use non-standard evaluation (NSE) inside the mutate/case_when statement to match the variable names you need for your custom function.

I do not know a way to do something similar with mutate_at.

Below I provide two examples, the most basic form (using your original data), and a more advanced version (which contains three weekdays and two channels and) which creates more than two variables. The latter requires an initial set-up using, for example, switch.

Basic example

library(tidyverse)

# your data
df <- data.frame(employee=seq(1:5),
                Mon_channelA=runif(5,1,10),
                Mon_channelB=runif(5,1,10),
                Tue_channelA=runif(5,1,10),
                Tue_channelB=runif(5,1,10)
                )

# custom function which takes two arguments, df and a string variable name
myfunc <- function(df, x) {

  mutate(df,
         # overwrites all "Mon_channel" variables ...
         !! paste0("Mon_", x) := case_when(  
           # ... with TRUE, when Mon_channel is smaller than Tue_channel, and FALSE else
            !! sym(paste0("Mon_", x)) < !! sym(paste0("Tue_", x))  ~ T,
            T ~ F
           )
         )
}

# define the variables you want to loop over
var_ls <- c("channelA", "channelB")

# use var_ls and myfunc with reduce on your data
df %>% 
  reduce(var_ls, myfunc, .init = .)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1        FALSE        FALSE     3.437975     2.458389
#> 2        2        FALSE         TRUE     3.686903     4.772390
#> 3        3         TRUE         TRUE     5.158234     5.378021
#> 4        4         TRUE         TRUE     5.338950     3.109760
#> 5        5         TRUE        FALSE     6.365173     3.450495

Created on 2020-02-03 by the reprex package (v0.3.0)

More advanced example

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'forcats' was built under R version 3.5.2

# your data plus one weekday with two channels
df <- data.frame(employee=seq(1:5),
                Mon_channelA=runif(5,1,10),
                Mon_channelB=runif(5,1,10),
                Tue_channelA=runif(5,1,10),
                Tue_channelB=runif(5,1,10),
                Wed_channelA=runif(5,1,10),
                Wed_channelB=runif(5,1,10)
                )

# custom function which takes two argument, df and a string variable name
myfunc <- function(df, x) {

  # an initial set-up is needed

  # id gets the original day
  id <- str_extract(x, "^\\w{3}")

  # based on id the day of comparison is mapped with switch
  y <- switch(id,
              "Mon" = "Tue",
              "Tue" = "Wed")

  # j extracts the channel name including the underscore
  j <- str_extract(x, "_channel[A-Z]{1}")

  # this makes the function definition rather easy:
  mutate(df,
         !! x := case_when(  
            !! sym(x) < !! sym(paste0(y, j))  ~ T,
            T ~ F
           )
         )
}

# define the variables you want to loop over
var_ls <- c("Mon_channelA",
            "Mon_channelB",
            "Tue_channelA",
            "Tue_channelB")

# use var_ls and myfunc with reduce on your data
df %>% 
  reduce(var_ls, myfunc, .init = .)
#>   employee Mon_channelA Mon_channelB Tue_channelA Tue_channelB
#> 1        1         TRUE         TRUE         TRUE        FALSE
#> 2        2        FALSE         TRUE         TRUE        FALSE
#> 3        3        FALSE         TRUE        FALSE         TRUE
#> 4        4        FALSE         TRUE         TRUE        FALSE
#> 5        5         TRUE        FALSE        FALSE        FALSE
#>   Wed_channelA Wed_channelB
#> 1     9.952454     5.634686
#> 2     9.356577     4.514683
#> 3     2.721330     7.107316
#> 4     4.410240     2.740289
#> 5     5.394057     4.772162

Created on 2020-02-03 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39