3

I wanted to modify levels in my factor variable by grouping two levels into one when I came across this strange situation. Basically, my new level is created, but all the remaining levels seem to be moved to the next one. Here is my example data, the code used and the output.

library(tidyverse) 
data <- structure(list(factor1 = structure(c(1L, 1L, 2L, 3L, 1L, 2L, 
        1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 3L, 1L, 1L, 1L, 4L), .Label = c("0", "1", "2", "3", 
        "4", "5", "6", "7"), class = "factor")), row.names = c(NA, -30L
        ), class = c("tbl_df", "tbl", "data.frame"), .Names = "factor1")
data_out <- data %>% mutate(factor1 = ifelse(factor1 %in% c('0', '1'), 
                                             factor1, '>1'))
structure(list(factor1 = c("1", "1", "2", ">1", "1", "2", "1", 
"1", "2", "2", "2", "2", "2", "1", "2", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", ">1", "1", "1", "1", ">1")), .Names = "factor1", 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))

Is it desirable behaviour? It certainly isn't in my case. How could it be explained and then corrected?

divibisan
  • 11,659
  • 11
  • 40
  • 58
jakes
  • 1,964
  • 3
  • 18
  • 50
  • Let me know if you want it reopened for the explanation part, but I guess it boils down to `ifelse` doing less than you expected. You can type `ifelse` at the command line to see its code and run through it. – Frank Mar 14 '18 at 18:45
  • @Frank: Boy, those answers surely did not explain this behavior to me. – IRTFM Mar 14 '18 at 18:47
  • @42- Sure, fair enough. It's really a two-part question, and those answer the "how to do it" part, not the "what particular way does `ifelse` fail me here?" part. (For OP's reference, the link we're talking apart shows up in the sidebar lower down under "Linked") – Frank Mar 14 '18 at 18:49
  • @jakes: I have no idea which link Frank is talking about. – IRTFM Mar 14 '18 at 18:55
  • @42-, well, neither do I – jakes Mar 14 '18 at 19:01
  • @42- If either of you wanted to know, you could have used @ to reach me. I meant the link that I used to close as a dupe (what 42- meant when referring to "those answers") -- it appears in the right sidebar even after the question is reopened. – Frank Mar 15 '18 at 15:09

2 Answers2

3

I'm guessing this problem revolves around the way factors are constructed. How a factor goes from having levels of {"0", "1"} to levels {"1","2", ">1"} by way of mutate was still not clear to me.

R factors are actually base-1 integer vectors with attributes that are their levels. So your "0" levels initially were actually integer-1's and your "1" levels were integer-2's. Apparently the mutate function saw fit to create a new factor with an additional level that was printed as ">1" but also reassigned the "0" level to a new "1"-level and the "1" level to a "2"-level. This looks like a dangerous behavior on hte part of mutate to me. I think it should have given you either a new factor with levels "0","1",">1" or it should have thrown an error.

The error comes from ifelse although mutate compunds the problem by making the new column into a factor as well. If you coerce data to a dataframe, then you see:

data$factor2 <- ifelse( data$factor1 %in% c('0', '1'), 
                                              data$factor1, '>1')
data
#-------- same issue except
   factor1 factor2
1        0       1
2        0       1
3        1       2
4        2      >1
.... delete the other 26 rows
> str(data)
'data.frame':   30 obs. of  2 variables:
 $ factor1: Factor w/ 8 levels "0","1","2","3",..: 1 1 2 3 1 2 1 1 2 2 ...
 $ factor2: chr  "1" "1" "2" ">1" ...

This would have let you stay in the dplyr package:

recode_factor(data$factor1, `0` = "0", `1` = "1", .default=">1")
 [1] 0  0  1  >1 0  1  0  0  1  1  1  1  1  0  1  0  0  0  0  0  0  0  0  0  0  >1 0  0  0  >1
Levels: 0 1 >1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Clear explanation, regained faith in my intuition. Many thanks! – jakes Mar 14 '18 at 19:10
  • I was working on a different answer using `recode_factor`. Good to see its usages. – MKR Mar 14 '18 at 23:20
  • I get that you were pissed about my closing the question (even though I immediately commented to offer to reopen if the OP was unsatisfied), but your opening sentence is "noise" and belongs in comments or chat if you really care about what I'm up to so much. Re staying in the dplyr package, there's also a forcats package that is maybe useful here (? ... not sure, haven't used it myself, but it also appears in the aforementioned link). – Frank Mar 15 '18 at 15:12
  • @Frank: I sincerely did hope you were writing an answer. I thought you might understand the mechanism of the `ifelse` substitution better than I. – IRTFM Mar 15 '18 at 15:47
  • Ok; my misunderstanding there. Anyway, running `ifelse` line by line with example `x <- factor(c("0","1","2")); ifelse(x %in% c("0","1"), x, ">1")`, it becomes `ans <- test <- x %in% c("0","1"); ans[test] <- x[test]; ans[!test] <- ">1"` with the last two steps coercing from logical to int, then int to character. So I mostly blame `ifelse` rather than mutate. Btw, besides forcats, I think hadley also defined his own `if_else` that might be safer. – Frank Mar 15 '18 at 15:59
3

Just in case of someone struggling with similar issue in future and looking for a easy way to group these factors without reassigned remaining one:

fct_collapse(data$factor1, '>1' = c('2', '3')) 
jakes
  • 1,964
  • 3
  • 18
  • 50