9

I'm attempting to change values of a variable into NA values if they're not in a vector:

sample <- factor(c('01', '014', '1', '14', '24'))
df <- data.frame(var1 = 1:6, var2 = factor(c('01', '24', 'none', '1', 'unknown', '24')))
df$var2 <- ifelse(df$var2 %in% sample, df$var2, NA)

For some reason R does not preserve original values of the factor variable but turns them into numeric sequence:

> sample <- factor(c('01', '014', '1', '14', '24'))
> df <- data.frame(var1 = 1:6, 
                   var2 = factor(c('01', '24', 'none', '1', 'unknown', '24')))
> class(df$var2)
[1] "factor"
> df
  var1    var2
1    1      01
2    2      24
3    3    none
4    4       1
5    5 unknown
6    6      24
> df$var2 <- ifelse(df$var2 %in% sample, df$var2, NA)
> class(df$var2)
[1] "integer"
> df
  var1 var2
1    1    1
2    2    3
3    3   NA
4    4    2
5    5   NA
6    6    3

Why does this happen and what would be the correct way of achieving what I'm trying to here?

(I need to use factors rather than integers in order not to confuse "01" and "1" and my original data set is large, so using factors rather than characters should save me some memory)

lillemets
  • 928
  • 1
  • 9
  • 15
  • 2
    Try `dplyr::if_else`. – tchakravarty Nov 10 '16 at 09:00
  • Have you tried simply adding `as.factor()` around your `ifelse()` function ? Like so : `df$var2 <- as.factor(ifelse(df$var2 %in% sample, df$var2, NA))` – Pierre Chevallier Nov 10 '16 at 09:04
  • by default in R when a vector of 3 elements has 2 character and 1 numeric then numeric is converted to character. In your example of "sample" which though are characters-contain all numeric values and so interbally ifelse returned it as numeric. If you want it to be character then use as.character() : ifelse(var2 %in% sample, as.character(var2), NA) – joel.wilson Nov 10 '16 at 09:05
  • 1
    Some statements in the Q require clarification, (1) R stores `character` as efficient as `factor`. I avoid `factor` unless absolutely needed. (2) `factor` _levels_ are stored as `integer`, so no surprise. (3) Please, read the _Warning_ section in `help("ifelse")`: _The mode of the result may depend on the value of test, and the class attribute of the result is taken from test and may be inappropriate for the values selected from yes and no._ See also the suggestions there to avoid mishaps. – Uwe Nov 10 '16 at 20:10

1 Answers1

2

I think one way to achieve what you are trying to do is to change the levels of your factor:

levels(df$var2)[!levels(df$var2) %in% sample] <- NA

By changing the levels all the values that are not matching these levels will be converted to the factor NA and the result will be:

df
  var1 var2
1    1   01
2    2   24
3    3 <NA>
4    4    1
5    5 <NA>
6    6   24

> df$var2
[1] 01   24   <NA> 1    <NA> 24  
Levels: 01 1 24

The unknown and none values are no longer in the factor levels. Or if you would like to keep the unknown and none in your values you could try this:

df$var2[!df$var2 %in% sample] <- NA

> df
  var1 var2
1    1   01
2    2   24
3    3 <NA>
4    4    1
5    5 <NA>
6    6   24


> df$var2
[1] 01   24   <NA> 1    <NA> 24  
Levels: 01 1 24 none unknown

The reason why ifelse is changing the class of your data is that ifelse does not maintain class. Read the second answer here: How to prevent ifelse() from turning Date objects into numeric objects

And a last way as @tchakravarty mentioned in the comments is to use if_else from dplyr!

Community
  • 1
  • 1
User2321
  • 2,952
  • 23
  • 46