34

I'm new to R and I'm trying to sum 2 columns of a given dataframe, if both the elements to be summed satisfy a given condition. To make things clear, what I want to do is:

> t.d<-as.data.frame(matrix(1:9,ncol=3))
> t.d
  V1 V2 V3
  1  4  7  
  2  5  8  
  3  6  9  

> t.d$V4<-rep(0,nrow(t.d))

> for (i in 1:nrow(t.d)){
+   if (t.d$V1[i]>1 && t.d$V3[i]<9){
+     t.d$V4[i]<-t.d$V1[i]+t.d$V3[i]}
+     }

> t.d    
  V1 V2 V3 V4
  1  4  7  0
  2  5  8 10
  3  6  9  0

I need an efficient code, as my real dataframe has about 150000 rows and 200 columns. This gives an error:

t.d$V4<-t.d$V1[t.d$V1>1]+ t.d$V3[t.d$V3>9] 

Is "apply" an option? I tried this:

t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d$V4<-rep(0,nrow(t.d))

my.fun<-function(x,y){
  if(x>1 && y<9){
    x+y}
}

t.d$V4<-apply(X=t.d,MAR=1,FUN=my.fun,x=t.d$V1,y=t.d$V3)

but it gives an error as well. Thanks very much for your help.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Elinka
  • 343
  • 1
  • 3
  • 4

3 Answers3

43

This operation doesn't require loops, apply statements or if statements. Vectorised operations and subsetting is all you need:

t.d <- within(t.d, V4 <- V1 + V3)
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0
t.d

  V1 V2 V3 V4
1  1  4  7  0
2  2  5  8 10
3  3  6  9  0

Why does this work?

In the first step I create a new column that is the straight sum of columns V1 and V4. I use within as a convenient way of referring to the columns of d.f without having to write d.f$V all the time.

In the second step I subset all of the rows that don't fulfill your conditions and set V4 for these to 0.

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 1
    Thank you! So simple and yet perfect. I can't believe I spent half a day thinking about this problem. – Elinka Jun 29 '11 at 11:24
  • 2
    If it makes you feel better, this type of problem made my head go flat when I started working with R. :-) – Andrie Jun 29 '11 at 11:28
25

ifelse is your friend here:

t.d$V4<-ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0)
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
10

I'll chip in and provide yet another version. Since you want zero if the condition doesn't mach, and TRUE/FALSE are glorified versions of 1/0, simply multiplying by the condition also works:

t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d <- within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9))

...and it happens to be faster than the other solutions ;-)

t.d <- data.frame(V1=runif(2e7, 1, 2), V2=1:2e7, V3=runif(2e7, 5, 10))
system.time( within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9)) )         # 3.06 seconds
system.time( ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0) ) # 5.08 seconds
system.time( { t.d <- within(t.d, V4 <- V1 + V3); 
               t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0 } )       # 4.50 seconds
Tommy
  • 39,997
  • 12
  • 90
  • 85