45

I want to replace <NA> values in a factors column with a valid value. But I can not find a way. This example is only for demonstration. The original data comes from a foreign csv file I have to deal with.

df <- data.frame(a=sample(0:10, size=10, replace=TRUE),
                 b=sample(20:30, size=10, replace=TRUE))
df[df$a==0,'a'] <- NA
df$a <- as.factor(df$a)

Could look like this

      a  b
1     1 29
2     2 23
3     3 23
4     3 22
5     4 28
6  <NA> 24
7     2 21
8     4 25
9  <NA> 29
10    3 24

Now I want to replace the <NA> values with a number.

df[is.na(df$a), 'a'] <- 88
In `[<-.factor`(`*tmp*`, iseq, value = c(88, 88)) :
  invalid factor level, NA generated

I think I missed a fundamental R concept about factors. Am I? I can not understand why it doesn't work. I think invalid factor level means that 88 is not a valid level in that factor, right? So I have to tell the factor column that there is another level?

zx8754
  • 52,746
  • 12
  • 114
  • 209
buhtz
  • 10,774
  • 18
  • 76
  • 149
  • 1
    I don't understand why you have the line of code, df$a <- as.factor(df$a) why do you want that column to be factors? – DarrenRhodes Aug 24 '16 at 15:01
  • 1
    @buhtz: if one does not sample a value of `0` in the `data.frame` call will not be able to replicate your problem, maybe better to `set.seed()`. – 000andy8484 Aug 24 '16 at 15:12
  • @000andy8484 Thanks for that hint. I will pin that to my notes for the next time. – buhtz Aug 24 '16 at 18:39
  • @user1945827 It is just to imitate my real data (commin from a foreign csv file) and real situation plus providing a minimal example. – buhtz Aug 24 '16 at 18:40
  • 2
    I would suggest that the factor is a red herring. When you import the data using the function read.csv() you need to set, stringsAsFactors=F and this will remove any factors in your resulting data.frame. – DarrenRhodes Aug 25 '16 at 07:15
  • @user1945827 Awsome! Thanks. – buhtz Aug 25 '16 at 07:16

6 Answers6

76

1) addNA If fac is a factor addNA(fac) is the same factor but with NA added as a level. See ?addNA

To force the NA level to be 88:

facna <- addNA(fac)
levels(facna) <- c(levels(fac), 88)

giving:

> facna
 [1] 1  2  3  3  4  88 2  4  88 3 
Levels: 1 2 3 4 88

1a) This can be written in a single line as follows:

`levels<-`(addNA(fac), c(levels(fac), 88))

2) factor It can also be done in one line using the various arguments of factor like this:

factor(fac, levels = levels(addNA(fac)), labels = c(levels(fac), 88), exclude = NULL)

2a) or equivalently:

factor(fac, levels = c(levels(fac), NA), labels = c(levels(fac), 88), exclude = NULL)

3) ifelse Another approach is:

factor(ifelse(is.na(fac), 88, paste(fac)), levels = c(levels(fac), 88))

4) forcats The forcats package has a function for this:

library(forcats)

fct_na_value_to_level(fac, "88")
## [1] 1  2  3  3  4  88 2  4  88 3 
## Levels: 1 2 3 4 88

Note: We used the following for input fac

fac <- structure(c(1L, 2L, 3L, 3L, 4L, NA, 2L, 4L, NA, 3L), .Label = c("1", 
"2", "3", "4"), class = "factor")

Update: Have improved (1) and added (1a). Later added (4).

Captain Hat
  • 2,444
  • 1
  • 14
  • 31
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hey :) I did 1a for a column in my data.frame. The level appears but if I want to calculate means for specific conditions, let say for all b in the above example that have the level NA I get NaN. I tried `mean(df$b[df$a==NA])` Also str(df) gives me: `Factor w/ 3 levels "1", "2", "3", NA:...` I think what I need is `"1", "2", "3", "NA"`right? – HonestRlover.not. Jun 22 '20 at 16:40
  • Option 3) worked for me and I could correctly apply it with a pipe. I tested with and without paste(fac) inside the ifelse statement and both worked fine for me. Any specific reason for why the paste needs to be included? – Sapiens Aug 05 '21 at 12:08
  • So that the factor is rebuilt from scratch. – G. Grothendieck Aug 05 '21 at 12:22
9

I had similar issues and I want to add what I consider the most pragmatic (and also tidy) solution:

Convert the column to a character column, use mutate and a simple ifelse-statement to change the NA values to what you want the factor level to be (I have chosen "None"), convert it back to a factor column:

df %>% mutate(
a = as.character(a),
a = ifelse(is.na(a), "None", a),
a = as.factor(a)
)

Clean and painless because you do not actually have to dabble with NA values when they occur in a factor column. You bypass the weirdness and end up with a clean factor variable.

Also, in response to the comment made below regarding multiple columns: You can wrap the statements in a function and use mutate_if to select all factor variables or, if you know the names of the columns of concern, mutate_at to apply that function:

replace_factor_na <- function(x){
  x <- as.character(x)
  x <- if_else(is.na(x), "None", x)
  x <- as.factor(x)
}

df <- df %>%
  mutate_if(is.factor, replace_factor_na)
8

other way to do is:

#check levels
levels(df$a)
#[1] "3"  "4"  "7"  "9"  "10"

#add new factor level. i.e 88 in our example
df$a = factor(df$a, levels=c(levels(df$a), 88))

#convert all NA's to 88
df$a[is.na(df$a)] = 88

#check levels again
levels(df$a)
#[1] "3"  "4"  "7"  "9"  "10" "88"
Karim Kanatov
  • 91
  • 1
  • 2
6

My way would be a little bit traditional by using factor function:

a <- factor(a, 
            exclude = NULL, 
            levels = c(levels(a), NA),
            labels = c(levels(a), "None"))

You can replace "None" with appropriate replacement that you want (0L for example)

Bảo Trần
  • 130
  • 1
  • 8
5

The basic concept of a factor variable is that it can only take specific values, i.e., the levels. A value not in the levels is invalid.

You have two possibilities:

If you have a variable that follows this concept, make sure to define all levels when you create it, even those without corresponding values.

Or make the variable a character variable and work with that.

PS: Often these problems result from data import. For instance, what you show there looks like it should be a numeric variable and not a factor variable.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • It is hard to decide where to put the green mark here! ;) Your answer provided me the background info about the basic concept I missed before. Thank you very much. – buhtz Aug 24 '16 at 18:44
4

The problem is that NA is not a level of that factor:

> levels(df$a)
[1] "2"  "4"  "5"  "9"  "10"

You can't change it straight away, but the following will do the trick:

df$a <- as.numeric(as.character(df$a))
df[is.na(df$a),1] <- 88
df$a <- as.factor(df$a)
> df$a
 [1] 9  88 3  9  5  9  88 8  3  9 
Levels: 3 5 8 9 88
> levels(df$a)
[1] "3"  "5"  "8"  "9"  "88"
000andy8484
  • 563
  • 3
  • 16
  • `df$a <- as.numeric(levels(df$a))[df$a]` is a slightly more efficient variant for `as.numeric(as.character())`. – 000andy8484 Aug 24 '16 at 15:16