1

I have a dataset with many columns but I will only mention the ones that are necessary for this operation and provide a provisional one (I also believe you do not need the ID information I just included it to make it easier to comprehend).

Business Division | Local Claim ID| CMB
       GC                123         **Y**       
       GC                124         N
       NAC               125         N
       NAC               126         N
       NAC               127         **Y**
       GC                128         N

and I would like to get rid of the CMB column while I would replace the CMB value for each business division if the original value is Y, basically I would want the table to look like following: (Business Divison now 3 classes)

Business Division | Local Claim ID
       **CMB**           123             
       GC                124        
       NAC               125        
       NAC               126        
       **CMB**           127        
       GC                128      

Here is the output of dput to reproduce my data:

structure(list(Business.Division = c("CMB", "GC", "NAC", "NAC", 
"CMB", "GC"), Local.Claim.ID = 123:128, CMB = c("Y", "N", 
"N", "N", "Y", "N")), .Names = c("Business.Division", "Local.Claim.ID", 
"CMB"), row.names = c(NA, -6L), class = "data.frame")
Steve
  • 2,401
  • 3
  • 24
  • 28
VKosik
  • 11
  • 3

4 Answers4

4

I would go with data.table if you want to both evaluate only the relevant rows and update in place

library(data.table)
setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL]
#    Business.Division Local.Claim.ID
# 1:               CMB            123
# 2:                GC            124
# 3:               NAC            125
# 4:               NAC            126
# 5:               CMB            127
# 6:                GC            128
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • `setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL] Error in any_na(as_list(RHS)) : object 'CanyNA' not found`; Am I missing a part of the package? – Pierre L Jun 08 '15 at 15:49
  • @plafort have no idea why you are getting this error. I've loaded the data without the `*` btw, as OP only used them to illustrate the desired values. Also, maybe update your `data.table` package. – David Arenburg Jun 08 '15 at 15:53
3

Try:

df %>% 
  mutate(Business.Division = replace(Business.Division, which(CMB == 'Y'), 'CMB')) %>% 
  select(-CMB)

Which gives:

#  Business.Division Local.Claim.ID
#1               CMB            123
#2                GC            124
#3               NAC            125
#4               NAC            126
#5               CMB            127
#6                GC            128

Benchmark

Updated to add suggestions on the benchmark:

df <- data.frame(Business.Division = sample(c("GC", "NAC"), 10e6, replace = TRUE),
                 Local.Claim.ID = sample(100:199, 10e6, replace = TRUE),
                 CMB = sample(c("Y", "N"), 10e6, replace = TRUE),
                 stringsAsFactors = FALSE)

library(microbenchmark)
mbm <- microbenchmark(
  me = mutate(df, Business.Division = replace(Business.Division,which(CMB == "Y"), "CMB")),
  stevensp = (df$Business.Division <- ifelse(df$CMB == "Y", "CMB", df$Business.Division)),
  mts = (df$Business.Division[which(df$CMB == "Y")] = "CMB"),
  david1 = setDT(df)[CMB == "Y", Business.Division := "CMB"],
  david2 = setkey(setDT(df), CMB)[.("Y"), Business.Division := "CMB"],
  times = 10
)

David's is much faster:

> mbm
Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval cld
       me  496.79251  556.70752  592.35165  608.23875  634.88809  661.33805    10  b 
 stevensp 3449.53516 3518.47649 3585.91006 3572.62433 3681.19332 3718.06284    10   c
      mts  591.22479  654.01000  661.02210  661.41281  679.53060  719.74752    10  b 
   david1   58.67554   62.15468   66.85337   62.31426   62.99337   92.49148    10 a  
   david2   86.04280   89.42500  117.76540   89.61656   89.79652  232.45398    10 a  

enter image description here

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    Or another option would be `replace` i.e. `df %>% mutate(Business.Division= replace(Business.Division, which(CMB=='**Y**'), '**CMB**')) %>% select(-CMB)` – akrun Jun 08 '15 at 15:30
  • 1
    @akrun Much better with replace, indeed. Will edit accordingly. – Steven Beaupré Jun 08 '15 at 16:54
  • 1
    And that's without pulling the heavy machinery (binary joins). Also, great proof that `ifelse` is pure ****. I wonder how `setkey(setDT(df), CBM)[.("Y"), Business.Division := "CMB"]` would scale. – David Arenburg Jun 08 '15 at 17:05
  • @DavidArenburg It's actually a bit slower. I'll update the benchmark. – Steven Beaupré Jun 08 '15 at 17:28
2

You can use ifelse alone (without any additional packages) for this purpose:

df$Business.Division = ifelse(df$CMB == "Y", "CMB", df$Business.Division)

As DavidArenburg points out in the comments below, this is not efficient for working with big data. But if your data isn't huge, this is a nice, simple way to go.

Steve
  • 2,401
  • 3
  • 24
  • 28
  • 1
    This code was already used by StevenBeaupre within `mutate`. I am guessing that you want to convey the message that `ifelse` can be used without any additional package? – akrun Jun 08 '15 at 15:36
  • @akrun Yes, exactly. This is a pretty basic task (conditional replacement of a variable value) -- and can be done just using base R. I edited my answer to make that more clear, though! – Steve Jun 08 '15 at 15:44
  • 1
    It's a basic task but very inefficient for a big data, see [here](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow). @mts solution would be much more preferable if OP has `character` vector instead of a `factor` – David Arenburg Jun 08 '15 at 15:55
  • @DavidArenburg - true! – Steve Jun 08 '15 at 16:48
1

Or (only works if businessdivision isn't a factor)

df$businessdivision[which(df$CMB == "Y")] = "CMB"
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
mts
  • 2,160
  • 2
  • 24
  • 34