0

First I am very new to R, and I'm aware that I may making an obvious mistake, I have searched for an answer, but maybe I'm searching for the wrong thing.

I am trying to apply a function to add a new column to a dataframe based on the contents of that row. But it looks to me like the values in the row are not being handled properly in the mutate function when using rowwise. I've tried to create a toy example to demonstrate my problem.

library(dplyr)    
x<-c("A,"B")
y<-c(1,2)
df<-data.frame(x,y)

Then I have a function to create a new column called z which adds 1 to y if the value of x is "A" and adds 2 to y if the value of x is "B". Note that I have added print(x) to show what is going on.

calculatez <- function(x,y){
  print(x)
  if(x == "A"){
    return (y+1)
  } 
  else{
    return(y+2)
  } 
}

I then try to use mutate:

df %>%
  rowwise() %>%
  mutate(z = calculatez(x,y))

and I get the following, 2 has been added to both rows, rather than 1 to the first row and the "A" and "B" have been passed into the function as 1 and 2.

[1] 1
[1] 2
Source: local data frame [2 x 3]
Groups: 

  x y z
1 A 1 3
2 B 2 4

If I remove the rowwise() function the "A" and "B" appear to be being passed properly, but clearly I don't get the right result.

df %>%
  mutate(z = calculatez(x,y))

[1] A B
Levels: A B
  x y z
1 A 1 2
2 B 2 3
Warning message:
In if (x == "A") { :
  the condition has length > 1 and only the first element will be used

I can get it to work if I try to do it without writing my own function and then I don't get the error message about the length of the condition. So I don't think I understand properly what rowwise() is doing.

df %>%
  mutate(z = ifelse(x=="A",y+1,y+2))

  x y z
1 A 1 2
2 B 2 4

But I want to be able to use my own function, because in my real application the condition is more complicated and it will be difficult to read with lots of nested ifelse functions in the mutate function.

I can get round the problem by changing my condition to if(x==1) but that will make my code difficult to understand.

I don't want to waste your time, so sorry if I'm missing something obvious. Any tips on where I'm going wrong?

tecb1234
  • 105
  • 4

1 Answers1

1

You could use rowwise with do

 df %>% 
 rowwise() %>% 
 do(data.frame(., z= calculatez(.$x, .$y)))

gives the output

     x y z
  #1 A 1 2
  #2 B 2 4

Or you could do:

  df %>%
  group_by(N=row_number()) %>% 
  mutate(z=calculatez(x,y))%>% 
  ungroup() %>%
  select(-N)

Using a different dataset:

df <- structure(list(x = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), y = c(1, 2, 1, 2, 1)), .Names = c("x", 
"y"), row.names = c(NA, -5L), class = "data.frame")

Running the above code gives:

 #  x y z
 #1 A 1 2
 #2 A 2 3
 #3 B 1 3
 #4 B 2 4
 #5 B 1 3

If you are using data.table

library(data.table)
setDT(df)[, z := calculatez(x,y), by=seq_len(nrow(df))]
df
#    x y z
# 1: A 1 2
# 2: A 2 3
# 3: B 1 3
# 4: B 2 4
# 5: B 1 3
Arun
  • 116,683
  • 26
  • 284
  • 387
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, that's great. I now understand how to get what I want. I haven't come across `do` before, so I'll read into that. But I guess I thought that `rowwise` did the equivalent of `group_by(N=row_number())`. For my general understanding, any ideas why my first attempt didn't work? – tecb1234 Sep 07 '14 at 13:32
  • @techb1234 According to the help page of ?rowwise() `‘rowwise’ is used for the results of ‘do’ when you create list-variables.` For some reason, when you combine `mutate` with row_wise() only the `else` loop from `calculatez` got executed. – akrun Sep 07 '14 at 13:50
  • Yep, because for some reason, the value of `x` gets passed as `1` or `2` not `"A"` or `"B"`, so therefore my `if(x=="A")` condition is always false. But thanks for pointing me at the help page, it seems like `rowwise` is only meant to work with `do`. Good to know. – tecb1234 Sep 07 '14 at 14:20
  • I've now tested both the `rowwise() %>% do(...` and the `group_by(N=row_number()) %>% mutate(...` solutions on my real dataset (about 54k rows). The `mutate` solution takes a couple of seconds to run. The `do` solution takes about 3 minutes! I guess this is because `mutate` is calling C++ code and `do` is calling R code? – tecb1234 Sep 07 '14 at 16:12
  • @tecb1234 That's new info for me. Thanks. You may have to look at the source code. Have you tested `data.table` solution? I guess it would be faster than the other methods. – akrun Sep 07 '14 at 16:15
  • No idea where to start with the source code! The `do` solution takes ~170s, the `mutate` solution takes ~2.2s, and the `data.table` solution takes ~2.0s. As you predicted, the `data.table` solution is the fastest. – tecb1234 Sep 07 '14 at 17:04
  • @tecb1234 Please check this link `http://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function` – akrun Sep 07 '14 at 18:03
  • akrun, @tecb1234, simplified the `data.table` solution. `by` understands expressions. – Arun Sep 07 '14 at 20:19
  • @akrun I'd say it's a bug in rowwise() + mutate(). – hadley Sep 11 '14 at 00:00