2

The title is a little garbled, but I'm not sure how else to describe it. I'm coming from Stata so still getting the hang of factors.

Basically, I want to be able to assign factor levels and labels, but any that I miss get assigned as a default level/label.

Take the following:

library(dplyr)
dt <- as.data.frame(mtcars)  # load demo data
dt$carb[4:6] <- NA           # set some rows to NA for example

dt <- dt%>%
  mutate(
    carb_f = factor(carb,
                    levels = c(1,2,3,4), 
                    labels = c("One","Two","Three","Four")
                    )
  )

table(dt$carb, dt$carb_f, exclude=NULL)

which yields the following:

       One Two Three Four <NA>
  1      5   0     0    0    0
  2      0   9     0    0    0
  3      0   0     3    0    0
  4      0   0     0   10    0
  6      0   0     0    0    1
  8      0   0     0    0    1
  <NA>   0   0     0    0    3

The unstated 6 and 8 are set to NA in the resultant factor carb_f. Although this is expected behaviour, I want to be able to request something like this:

dt <- dt%>%
  mutate(
    carb_f = factor(carb,
                    levels = c(1,2,3,4), 
                    labels = c("One","Two","Three","Four"),
                    non-na(10,"Unk")   # obvious pseudocode
                    )
  )

to yield this:

       One Two Three Four Unk <NA>
  1      5   0     0    0   0    0
  2      0   9     0    0   0    0
  3      0   0     3    0   0    0
  4      0   0     0   10   0    0
  6      0   0     0    0   1    0
  8      0   0     0    0   1    0
  <NA>   0   0     0    0   0    3

...where the unstated 6 and 8 are assigned to a default level/label of 10 and Unk, but the true NA remain NA.

Is there a way of handling this without explicitly referencing 6 and 8 ?

  • 1
    Related: [R: factor levels, recode rest to 'other'](https://stackoverflow.com/questions/15533594/r-factor-levels-recode-rest-to-other); [Cleaning up factor levels (collapsing multiple levels/labels)](https://stackoverflow.com/questions/19410108/cleaning-up-factor-levels-collapsing-multiple-levels-labels) – Henrik Sep 09 '22 at 09:08

1 Answers1

2

Just use the same label multiple times.

dt <- transform(dt, carb_f=factor(carb, labels=c('one', 'two', 'three', 'four', 'unk', 'unk')))
table(dt$carb, dt$carb_f, useNA='ifany')
#      one two three four unk <NA>
# 1      5   0     0    0   0    0
# 2      0   9     0    0   0    0
# 3      0   0     3    0   0    0
# 4      0   0     0   10   0    0
# 6      0   0     0    0   1    0
# 8      0   0     0    0   1    0
# <NA>   0   0     0    0   0    3

Note: I omitted the levels= attribute since the automatic alphabetical ordering is sufficient. However it can be very helpful if we want different order, e.g. levels=c(2, 1, 3, 4, 6, 8) to use 2 as the first (hence reference) level; further note, that levels and labels correspond in their positions.

To avoid typing the label multiple times, combine the respective levels into a new level, higher as all others, e.g. Inf and use factor in a second step. This can easily be done using within.

dt <- within(dt, {
  carb_f <- ifelse(carb %in% c(6, 8), Inf, carb)
  carb_f <- factor(carb_f, labels=c('one', 'two', 'three', 'four', 'unk'))
})

table(dt$carb, dt$carb_f, useNA='ifany')
#      one two three four unk <NA>
# 1      5   0     0    0   0    0
# 2      0   9     0    0   0    0
# 3      0   0     3    0   0    0
# 4      0   0     0   10   0    0
# 6      0   0     0    0   1    0
# 8      0   0     0    0   1    0
# <NA>   0   0     0    0   0    3
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks for this. Makes sense, as does your note. I was hoping that there might be a more programmatic way of doing this where, for example, if there were another 12 unstated levels they would all be collapsed into "Unk" without me having to type "Unk" 12 times. It may be that I'm coming at it from the wrong direction and that a pre-factor recode is a necessary step. – Warren Holroyd Sep 09 '22 at 08:46
  • @WarrenHolroyd Programmatically combining "unk"-levels beforehand also is straightforward, please see my updated answer. – jay.sf Sep 09 '22 at 08:57
  • Thanks @jay.sf. This gets me to where I need to be. My actual data has no point-of-capture validation. So, in this analogy, 6 and 8 may not be the only "invalid" values that I wish to collapse in the carb column. However, based on your suggestion, I can negate the %in% statement against a list of "valid" codes (in this example, 1 thru 4) so that I never explicitly refer to 6 or 8 (or any other out-of-range value). I now have a working process. Thank you for your assistance. Marked as the answer. – Warren Holroyd Sep 09 '22 at 23:06