8

I frequently have to calculate new variables from existing ones in a data frame based on a condition of a factor variable.

Edit Getting 4 answers in like 2 mins, I realised I have oversimplified my example. Please see below.

Simple example:

df <- data.frame(value=c(1:5),class=letters[1:5])
df
value class
1     a
2     b
3     c
4     d
5     e

I can use such code

df %>% 
    mutate(result=NA) %>%
    mutate(result=ifelse(class=="a",value*1,result)) %>%
    mutate(result=ifelse(class=="b",value*2,result)) %>%
    mutate(result=ifelse(class=="c",value*3,result)) %>%
    mutate(result=ifelse(class=="d",value*4,result)) %>%
    mutate(result=ifelse(class=="e",value*5,result))

to perform conditional calculations on my variables, resulting in

value class result
 1     a      1
 2     b      4
 3     c      9
 4     d     16
 5     e     25

As in reality the number of classes is larger and the calculations are more complex, however, I would prefer something cleaner, like this

df %>%
mutate(results=switch(levels(class),
                    "a"=value*1,
                    "b"=value*2,
                    "c"=value*3,
                    "d"=value*4,
                    "e"=value*5))

which obviously doesn't work

Error in switch(levels(1:5), a = 1:5 * 1, b = 1:5 * 2, c = 1:5 * 3, d =  1:5 *  : 
  EXPR must be a length 1 vector

Is there a way I can do this more nicely with dplyr piping (or else)?

Edit In reality, I have more value variables to include in my calculations and they are not simple consecutive vectors, they are thousands of rows of measured data.

Here my simple example with a second random value variable (again, it's more in my real data)

df <- data.frame(value1=c(1:5),value2=c(2.3,3.6,7.2,5.6,0),class=letters[1:5])
value1 value2 class
  1    2.3     a
  2    3.6     b
  3    7.2     c
  4    5.6     d
  5    0.0     e

and my calculations are different for every condition. I understand I can simplify somewhat like this

df %>% 
mutate(result=NA,
     result=ifelse(class=="a",value1*1,result),
     result=ifelse(class=="b",value1/value2*4,result),
     result=ifelse(class=="c",value2*3.57,result),
     result=ifelse(class=="d",value1+value2*2,result),
     result=ifelse(class=="e",value2/value1/5,result))

A working solution similar to the above switch example would be even cleaner, though.

user3460194
  • 511
  • 1
  • 4
  • 14
  • 2
    omg, you are so fast... And I have given a poor simplification of my real problem. Give me some time for an edit please, before posting new solutions... – user3460194 Jun 17 '15 at 16:44
  • There is no need for multiple mutates, you can reference to "fresh" varibale within the same mutate, i.e.: `mutate(result=NA, result=ifelse(class=="a",value*1,result)), result= ... )` – zx8754 Jun 17 '15 at 16:49
  • Thanks, I included this in my edit. Some improvement, I agree, but can it be even cleaner? – user3460194 Jun 17 '15 at 17:00
  • Heh after this edit, non of the solutions would work... – zx8754 Jun 17 '15 at 17:01
  • I know, Sorry, I am not really familiar yet with converting my real problems to simple examples for easy posting here... – user3460194 Jun 17 '15 at 17:03
  • 1
    more or less same as [this question](http://stackoverflow.com/q/19054723/817778). – eddi Jun 17 '15 at 17:11
  • @user3460194 you could used one nested `ifelse`. – zx8754 Jun 17 '15 at 17:18
  • @eddi hm, no, I don't think so. I don't aggregate, I calculate row-wise. – user3460194 Jun 17 '15 at 17:22
  • @zx8754 True, but these would become a real hassle to read, I am afraid. – user3460194 Jun 17 '15 at 17:23
  • Hm, it seems the updated ifelse solution is the cleanest it can get for my special case. I am actually a bit surprised that something like my (non-working) switch solution does not exist. I thought it was just my ignorance... Isn't this a quite common scenario: dynamically filter a dataset and apply different calculations to each data subset using different other variables of the dataset? – user3460194 Jun 18 '15 at 07:57

4 Answers4

10

No need to use ifelse here, You can use merge:

df <- data.frame(value=c(1:5),class=letters[1:5])
cond <- data.frame(ratio=c(1:5),class=letters[1:5])
transform(merge(df,cond),result=value*ratio)

  class value ratio result
1     a     1     1      1
2     b     2     2      4
3     c     3     3      9
4     d     4     4     16
5     e     5     5     25

After OP edit

It looks that the OP wants to apply a different function for each class. Here a data.table solution. I think it is simple and readable. First, I create function for each factor:

## here each function takes a data.table as an single argument
fns <- list(
  function(x) x[,value1]*1,
  function(x) x[,value1]/x[,value2]*4,
  function(x) x[,value2]*3.57,
  function(x) x[,value1]+x[,value2]*2,
  function(x) x[,value2]/x[,value1]/5
)
## create a names list here 
## the names here are just the class factors
fns <- setNames(fns,letters[1:5])

Applying the function by class is straightforward. I create the function name , and I use do.call to call a function by its name

## using data.table here for grouping feature
## .SD is the rest of columns except the grouping variable
## the code can also be written in dplyr or in base-R
library(data.table)
setDT(df)[,value:= fns[[class]](.SD),by=class]

     value1 value2 class     value
 1:      1    2.3     a  1.000000
 2:      2    3.6     b  2.222222
 3:      3    7.2     c 25.704000
 4:      4    5.6     d 15.200000
 5:      5    0.0     e  0.000000
 6:      1    2.3     a  1.000000
 7:      2    3.6     b  2.222222
 8:      3    7.2     c 25.704000
 9:      4    5.6     d 15.200000
10:      5    0.0     e  0.000000

I use this df:

df <- data.frame(value1=c(1:5),value2=c(2.3,3.6,7.2,5.6,0),
                 class=rep(letters[1:5],2))
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks, but see my edit. It's not just a simple ratio, the calculations are different for every condition. – user3460194 Jun 17 '15 at 17:01
  • Thanks for your update. This surely is a readable solution for someone familiar with the data.table syntax, which I am not, however... Also, defining the functions in this way does work well in this example. In my real problem, however, the calculations to perform use ~20 different variables, while each class only needs a subset of them. As far as I understand in your solution I would have to create functions for all of my ~30 classes with all 20 variables included, even if most of the classes only need 4-5 of them. Anyway, that's again my fault as my example is still too simple. I apologize. – user3460194 Jun 18 '15 at 07:48
  • @user3460194 you can see my edit. Now the functions takes a data.table as input and choose the variable internally. The more you explain your problem , you best you get a solution. I hope it is ok this time. – agstudy Jun 18 '15 at 08:14
  • Ok, Thank you, this is interesting. Even though I will probably stick with my ifelse solution for now, I've definitely learned something here! And next time I will try and do better in explaining... – user3460194 Jun 18 '15 at 08:38
3

As I mentioned in the comments, this question is more or less the same as this one (and you should read the answer there to catch up on what's going on below):

library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, class := as.character(class)] # simpler

# create a data.table with *functions* to match each class
fns = data.table(cls = letters[1:5], fn = list(quote(value1*1), quote(value1/value2*4), quote(value2*3.57), quote(value1+value2*2), quote(value2/value1/5)), key = 'cls')

# I have to jump through hoops here, due to a bug or two, see below
setkey(dt, class)
newvals = dt[, eval(fns[class]$fn[[1]], .SD), by = class]$V1
dt[, result := newvals][]
#   value1 value2 class    result
#1:      1    2.3     a  1.000000
#2:      2    3.6     b  2.222222
#3:      3    7.2     c 25.704000
#4:      4    5.6     d 15.200000
#5:      5    0.0     e  0.000000

Due to a few bugs in data.table the following, straightforward versions don't work yet:

dt[, result := eval(fns[class]$fn[[1]], .SD), by = class]

# or even better
dt[fns, result := eval(fn[[1]], .SD), by = .EACHI]

Bug reports have been filed.


I'm adding the suggestion in the comments from Frank below, as I think it's pretty cool and this way it's more likely to be preserved in SO. A more readable way of creating the function table is as follows:

quotem <- function(...) as.list(sys.call())[-1]

fnslist <- quotem(a = value1*1,
                  b = value1/value2*4,
                  c = value2*3.57,
                  d = value1+value2*2,
                  e = value2/value1/5)

fns = data.table(cls=names(fnslist),fn=fnslist,key="cls")
Community
  • 1
  • 1
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    Here's a more readable/writable way of constructing `fns`: `quotem <- function(...) as.list(sys.call())[-1]; fnslist <- quotem(a = value1*1, b = value1/value2*4, c = value2*3.57, d = value1+value2*2, e = value2/value1/5); fns = data.table(cls=names(fnslist),fn=fnslist,key="cls")` – Frank Jun 17 '15 at 18:32
  • @eddi OK, you are right. I am not familiar at all with data.table and the dplyr solution in the referenced question seemed to be based on summarise which is an aggregation function. Anyway, Thanks for your help, but I guess I will have to stay with my dplyr ifelse solutions – user3460194 Jun 18 '15 at 07:51
  • @user3460194 the `dplyr` solution there indeed can't do what the `data.table` one does (and I don't actually know if a function matching solution like this one is possible with `dplyr`). `ifelse` is quite slow and inefficient, but if it fits your needs, then ok. – eddi Jun 18 '15 at 14:41
2

Similar idea using dplyr and @agstudy example:

library(dplyr)

df %>% 
  left_join(cond) %>% 
  mutate(result = value * ratio)

Which gives:

#  value class ratio result
#1     1     a     1      1
#2     2     b     2      4
#3     3     c     3      9
#4     4     d     4     16
#5     5     e     5     25
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
2

Here's the dplyr/lazyeval implementation of @agstudy's answer:

# required packages
require(lazyeval)
require(dplyr)
# data (from @agstudy)
df <- data.frame(value1=c(1:5),value2=c(2.3,3.6,7.2,5.6,0),
                 class=rep(letters[1:5],2))
# functions (lazy instead of functions)
fns <- list(
  a = lazy(x*1), 
  b = lazy(x/y*4), 
  c = lazy(y*3.57),
  d = lazy(x+y*2),
  e = lazy(y/x/5)
)
# mutate call
df %>% 
  group_by(class) %>%
  mutate(value = lazy_eval(fns[class][[1]], 
                           list(x = value1, y = value2)))
shadow
  • 21,823
  • 4
  • 63
  • 77
  • Thank you, interesting! Do I need to name the variables x,y,z... and them pass them as a list to the lazy_eval function? Or is there a way to use the real names of the variables directly in the fns list? – user3460194 Jun 19 '15 at 07:14