Conditional replacement of character variable in R

Question

I have a dataset with many columns but I will only mention the ones that are necessary for this operation and provide a provisional one (I also believe you do not need the ID information I just included it to make it easier to comprehend).

Business Division | Local Claim ID| CMB
       GC                123         **Y**       
       GC                124         N
       NAC               125         N
       NAC               126         N
       NAC               127         **Y**
       GC                128         N

and I would like to get rid of the CMB column while I would replace the CMB value for each business division if the original value is Y, basically I would want the table to look like following: (Business Divison now 3 classes)

Business Division | Local Claim ID
       **CMB**           123             
       GC                124        
       NAC               125        
       NAC               126        
       **CMB**           127        
       GC                128

Here is the output of dput to reproduce my data:

structure(list(Business.Division = c("CMB", "GC", "NAC", "NAC", 
"CMB", "GC"), Local.Claim.ID = 123:128, CMB = c("Y", "N", 
"N", "N", "Y", "N")), .Names = c("Business.Division", "Local.Claim.ID", 
"CMB"), row.names = c(NA, -6L), class = "data.frame")

score 4 · Answer 1 · answered Jun 08 '15 at 15:23

4

I would go with data.table if you want to both evaluate only the relevant rows and update in place

library(data.table)
setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL]
#    Business.Division Local.Claim.ID
# 1:               CMB            123
# 2:                GC            124
# 3:               NAC            125
# 4:               NAC            126
# 5:               CMB            127
# 6:                GC            128

answered Jun 08 '15 at 15:23

David Arenburg

91,361
17
137
196

`setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL] Error in any_na(as_list(RHS)) : object 'CanyNA' not found`; Am I missing a part of the package? – Pierre L Jun 08 '15 at 15:49
@plafort have no idea why you are getting this error. I've loaded the data without the `*` btw, as OP only used them to illustrate the desired values. Also, maybe update your `data.table` package. – David Arenburg Jun 08 '15 at 15:53

Steven Beaupré · Answer 2 · 2015-06-08T17:30:02.730

Try:

df %>% 
  mutate(Business.Division = replace(Business.Division, which(CMB == 'Y'), 'CMB')) %>% 
  select(-CMB)

Which gives:

#  Business.Division Local.Claim.ID
#1               CMB            123
#2                GC            124
#3               NAC            125
#4               NAC            126
#5               CMB            127
#6                GC            128

Benchmark

Updated to add suggestions on the benchmark:

df <- data.frame(Business.Division = sample(c("GC", "NAC"), 10e6, replace = TRUE),
                 Local.Claim.ID = sample(100:199, 10e6, replace = TRUE),
                 CMB = sample(c("Y", "N"), 10e6, replace = TRUE),
                 stringsAsFactors = FALSE)

library(microbenchmark)
mbm <- microbenchmark(
  me = mutate(df, Business.Division = replace(Business.Division,which(CMB == "Y"), "CMB")),
  stevensp = (df$Business.Division <- ifelse(df$CMB == "Y", "CMB", df$Business.Division)),
  mts = (df$Business.Division[which(df$CMB == "Y")] = "CMB"),
  david1 = setDT(df)[CMB == "Y", Business.Division := "CMB"],
  david2 = setkey(setDT(df), CMB)[.("Y"), Business.Division := "CMB"],
  times = 10
)

David's is much faster:

> mbm
Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval cld
       me  496.79251  556.70752  592.35165  608.23875  634.88809  661.33805    10  b 
 stevensp 3449.53516 3518.47649 3585.91006 3572.62433 3681.19332 3718.06284    10   c
      mts  591.22479  654.01000  661.02210  661.41281  679.53060  719.74752    10  b 
   david1   58.67554   62.15468   66.85337   62.31426   62.99337   92.49148    10 a  
   david2   86.04280   89.42500  117.76540   89.61656   89.79652  232.45398    10 a

enter image description here

Or another option would be `replace` i.e. `df %>% mutate(Business.Division= replace(Business.Division, which(CMB=='**Y**'), '**CMB**')) %>% select(-CMB)` — akrun, Jun 08 '15 at 15:30
@akrun Much better with replace, indeed. Will edit accordingly. — Steven Beaupré, Jun 08 '15 at 16:54
And that's without pulling the heavy machinery (binary joins). Also, great proof that `ifelse` is pure ****. I wonder how `setkey(setDT(df), CBM)[.("Y"), Business.Division := "CMB"]` would scale. — David Arenburg, Jun 08 '15 at 17:05
@DavidArenburg It's actually a bit slower. I'll update the benchmark. — Steven Beaupré, Jun 08 '15 at 17:28

Steve · Answer 3 · 2015-06-09T19:26:53.163

2

You can use ifelse alone (without any additional packages) for this purpose:

df$Business.Division = ifelse(df$CMB == "Y", "CMB", df$Business.Division)

As DavidArenburg points out in the comments below, this is not efficient for working with big data. But if your data isn't huge, this is a nice, simple way to go.

edited Jun 09 '15 at 19:26

answered Jun 08 '15 at 15:27

Steve

2,401
3
24
28

1

This code was already used by StevenBeaupre within `mutate`. I am guessing that you want to convey the message that `ifelse` can be used without any additional package? – akrun Jun 08 '15 at 15:36
@akrun Yes, exactly. This is a pretty basic task (conditional replacement of a variable value) -- and can be done just using base R. I edited my answer to make that more clear, though! – Steve Jun 08 '15 at 15:44
1

It's a basic task but very inefficient for a big data, see [here](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow). @mts solution would be much more preferable if OP has `character` vector instead of a `factor` – David Arenburg Jun 08 '15 at 15:55
@DavidArenburg - true! – Steve Jun 08 '15 at 16:48

score 1 · Answer 4 · edited Jun 09 '15 at 06:38

1

Or (only works if businessdivision isn't a factor)

df$businessdivision[which(df$CMB == "Y")] = "CMB"

edited Jun 09 '15 at 06:38

David Arenburg

91,361
17
137
196

answered Jun 08 '15 at 15:23

mts

2,160
2
24
34

This will only work if `businessdivision` isn't a factor – David Arenburg Jun 08 '15 at 15:25
true, I'm assuming a data.frame – mts Jun 08 '15 at 15:28
2

We all assuming a `data.frame`. What else would we assume? – David Arenburg Jun 08 '15 at 15:28

Conditional replacement of character variable in R

4 Answers4