3

I'm wondering if there's a dplyr equivalent to

df <- data.frame(A=1:5,B=2:6,C=-1:3)
df[df==2] <- 10

I'm looking for

df %>% <??>

That is, a statement that is chainable with other dplyr verbs

CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    Not that I'm aware of, but did you look for it? This seem to be asked before https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-on-a-subset-of-rows or https://stackoverflow.com/questions/23078891/how-to-update-values-with-dplyr – David Arenburg Jul 16 '17 at 20:53
  • @DavidArenburg I always try to look. Doesn't mean I'm always successful. Thanks for the link...Could be the answer is a soft *NO*, considering the first link wrote a function to do this. The second link only mutates for a single column but looks like a promising start... – CPak Jul 16 '17 at 20:56
  • I'm not up-to-date with `dplyr` though. It seem to evolve daily. Not to mention it is a part of `tidyverse` so you would probably need to go thru about 1K functions to be sure. Also, maybe `magrittr` has something to offer. Regarding the second link, it's just the standard base `ifelse`, similarly like in the first link they use `\`[<-.data.frame\`` and `replace` - all base R stuff undercover (with no reason IMO). – David Arenburg Jul 16 '17 at 20:58
  • 1
    `df %>% mutate_all(funs(ifelse(. == 2, 10, .)))` – www Jul 16 '17 at 21:03
  • Yes, thanks, if you post as an answer I'll accept it. – CPak Jul 16 '17 at 21:04
  • For me , between base R and advance R, I will pick base R – BENY Jul 16 '17 at 21:09

2 Answers2

7

1) replace Try this. It only requires magrittr although dplyr imports the relevant part of magrittr so it will work with dplyr too:

df %>% replace(. == 2, 10)

giving:

   A  B  C
1  1 10 -1
2 10  3  0
3  3  4  1
4  4  5 10
5  5  6  3

1a) overwriting Note that the above is non-destructive so if you want to update df then you will need to assign it back:

df <- df %>% replace(. == 2, 10)

or

df %>% replace(. == 2, 10) -> df

or use the magrittr %<>% operator which eliminates referencing df twice:

df %<>% replace(. == 2, 10)

2) arithmetic This would also work:

df %>% { 10 * (. == 2) + . * (. != 2) }
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • So it's just a base R solution like in the other links. I don't see anything dplyr related here (which was the question about). Neither I see how is it any better than `replace(df, df == 2, 10)` or `df[df==2] <- 10` – David Arenburg Jul 16 '17 at 21:31
  • 2
    I pointed out quite clearly that it only requires magrittr Also note that the %<>% solution only requires one reference to df vs. 2 for base. I think the intent of the question was how to do it in a pipeline. – G. Grothendieck Jul 16 '17 at 21:33
  • You could do `df = mutate_all(df, funs(replace(., .==2, 10)))` if you wanted to use a `dplyr` function, but as @G.Grothendieck points out, you don't need `mutate_all` in this case. – eipi10 Jul 16 '17 at 21:49
  • Yes, I was simply looking for a pipeable solution. I have edited the original post to make this clearer. – CPak Jul 16 '17 at 22:15
2

The OP's question is about how to replace values using dplyr, and it has been resolved thanks to G. Grothendieck. But I am curious that how the performances differ between different approaches based on dplyr, data.table and base R. So I designed and conducted the following benchmarking.

# Load package
library(dplyr)
library(data.table)
library(microbenchmark)

# Create example data frame
df <- data.frame(A = 1:5, B = 2:6, C = -1:3)
# Convert to data.table
dt <- as.data.table(df)

# Method 1: Use mutate_all and ifelse
F1 = function(df){df %>% mutate_all(funs(ifelse(. == 2, 10, .)))}
# Method 2: Use mutate_all and replace
F2 = function(df){df %>% mutate_all(funs(replace(., . == 2, 10)))}
# Method 3: Use replace
F3 = function(df){df %>% replace(. == 2, 10)}
# Method 4: Base R data frame assignment
F4 = function(df){
  df[df == 2] <- 10
  return(df)
}

# Benchmarking
microbenchmark(
  M1 = F1(df),
  M2 = F2(df),
  M3 = F3(df),
  M4 = F4(df),
  # Same as M4, but use data.table object as input
  M5 = F4(dt)
)

Unit: microseconds
 expr      min         lq       mean     median         uq       max neval
   M1 8634.974 13028.7975 17224.4669 14907.3735 19496.5275 79750.182   100
   M2 8925.565 12626.2675 16698.7412 15551.7410 18658.1125 35468.760   100
   M3  282.252   391.6240   591.2534   553.5980   647.8965  3290.797   100
   M4  163.578   252.1025   423.7627   349.6080   420.8125  5415.382   100
   M5  228.367   333.2495   596.1735   440.3775   555.5230  7506.609   100 

The results show that mutata_all with ifelse (M1) or replace (M2) are much slower than other approaches. Use replace with pipe (M3) is fast, but still a little bit slower than base R (M4). Convert data.frame to data.table and then apply the assignment replacement (M5) is not faster than M4.

So, I think in this case, there are no special needs to use dplyr functions because it is not faster than base R method (M4). There are also no needs to convert data.frame to data.table If pipe operation is desirable. We can use pipe with replace (M3). Or, we can define a function, such as F4, and put it in the pipe operation.

www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for the legwork. Good to know how they stack up in performance. In my experience, non-answer Answers are *controversial*, but I think it deserves a look. – CPak Jul 16 '17 at 22:19
  • 1
    idiom consistency usually trumps "microsecond performance" benefits in my extensive experience in the real world – hrbrmstr Jul 17 '17 at 01:07
  • @hrbrmstr I agree. Idiom consistency helps code readability. For small data frames, idiom consistency is more important than "microsecond performance". But if there are lots of data frames or a large data frame, "microsecond performance" could be important. When I said "in this case", I am not only refer to this small data frame, but this kind of general operation. – www Jul 17 '17 at 02:27