97

One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.

Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • 7
    I'm adding this comment for beginner R users who are likely to find this question. I recently wrote a blog post that compiles much of the information from the answers below into an instructional tutorial on when, how and why to use factors in R. http://gormanalysis.com/?p=115 – Ben Jul 21 '14 at 01:22
  • I had always assumed factors were stored more efficiently than characters—as if each entry were a pointer to the level. But on testing it to write this up, I found out that’s not true! – isomorphismes Apr 15 '15 at 13:39
  • 2
    @isomorphismes well, that _used_ to be true, in the earlier days of R, but that has changed. See this blog post: http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ – MichaelChirico Jan 24 '17 at 19:12
  • 4
    5+ years later this "stringsAsFactors: An unauthorized biography" was written: http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ – JD Long Feb 10 '17 at 10:42

8 Answers8

49

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
#     Speciessetosa  Speciesversicolor   Speciesvirginica  
#             1.462              4.260              5.552  

Warning message: In model.matrix.default(mt, mf, contrasts) :

variable Species converted to a factor

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
#  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa"     "versicolor" "virginica" 
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Vince
  • 7,608
  • 3
  • 41
  • 46
31

ordered factors are awesome, if I happen to love oranges and hate apples but don't mind grapes I don't need to manage some weird index to say so:

d <- data.frame(x = rnorm(20), f = sample(c("apples", "oranges", "grapes"), 20, replace = TRUE, prob = c(0.5, 0.25, 0.25)))
d$f <- ordered(d$f, c("apples", "grapes", "oranges"))
d[d$f >= "grapes", ]
mdsumner
  • 29,099
  • 6
  • 83
  • 91
  • that's a neat application. Never thought of that. – JD Long Aug 10 '10 at 14:39
  • What did the `d$f <- ordered(d$f, c("apples", "grapes", "oranges"))` do? I would have guessed that it ordered these in the data frame, but after I run that line and print the data frame, nothing changes. Does it just impose an internal order even though the printed order doesn't change? – Addem Oct 30 '14 at 22:29
  • ... Yeah, I think what I wrote was something like a correct sentence. If I understand your point, you are showing us that you can assign an ordering on factors, which is something you cannot do for strings. – Addem Oct 30 '14 at 22:31
  • 4
    ordered() creates an arbitrary ordering from any values - in the order you say they are ordered. It's unfortunate that I used lexicographically sorted values, that's a coincidence. For example I use this for data where "Z" is bad, "3" is good but the labels are not numeric *or* alphabetical - so I do ordered(data, c("Z", "B", "A", "0", "1", "2", "3")) and so then I can just do data > "A" and it's happy days. – mdsumner Oct 31 '14 at 11:42
21

A factor is most analogous to an enumerated type in other languages. Its appropriate use is for a variable which can only take on one of prescribed set of values. In these cases, not every possible allowed value may be present in any particular set of data and the "empty" levels accurately reflect that.

Consider some examples. For some data which was collected all across the United States, the state should be recorded as a factor. In this case, the fact that no cases were collected from a particular state is relevant. There could have been data from that state, but there happened (for whatever reason, which may be a reason of interest) to not be. If hometown was collected, it would not be a factor. There is not a pre-stated set of possible hometowns. If data were collected from three towns rather than nationally, the town would be a factor: there are three choices that were given at the outset and if no relevant cases/data were found in one of those three towns, that is relevant.

Other aspects of factors, such as providing a way to give an arbitrary sort order to a set of strings, are useful secondary characteristics of factors, but are not the reason for their existence.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
13

Factors are fantastic when one is doing statistical analysis and actually exploring the data. However, prior to that when one is reading, cleaning, troubleshooting, merging and generally manipulating the data, factors are a total pain. More recently, as in the past few years a lot of the functions have improved to handle the factors better. For instance, rbind plays nicely with them. I still find it a total nuisance to have left over empty levels after a subset function.

#drop a whole bunch of unused levels from a whole bunch of columns that are factors using gdata
require(gdata)
drop.levels(dataframe)

I know that it is straightforward to recode levels of a factor and to rejig the labels and there are also wonderful ways to reorder the levels. My brain just cannot remember them and I have to relearn it every time I use it. Recoding should just be a lot easier than it is.

R's string functions are quite easy and logical to use. So when manipulating I generally prefer characters over factors.

Farrel
  • 10,244
  • 19
  • 61
  • 99
6

What a snarky title!

I believe many estimation functions allow you to use factors to easily define dummy variables... but I don't use them for that.

I use them when I have very large character vectors with few unique observations. This can cut down on memory consumption, especially if the strings in the character vector are longer-ish.

PS - I'm joking about the title. I saw your tweet. ;-)

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 1
    So you really just use them to conserve storage space. That makes sense. – JD Long Aug 10 '10 at 01:53
  • 13
    Well at least it used to ;-). But a few R version ago character storage was rewritten to be internally hashed so part of this historic argument is now void. Still factors are *very* useful for grouping and modeling. – Dirk Eddelbuettel Aug 10 '10 at 01:56
  • 1
    According to `?factor` it was R-2.6.0 and it says, "Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes." Would you save space converting to factor if the character string needed 8 bytes? – Joshua Ulrich Aug 10 '10 at 02:25
  • 2
    N <- 1000;a <- sample(c("a","b", "c"), N, replace=TRUE); print(object.size(a), units="Kb"); print(object.size(factor(a)), units="Kb"); 8 Kb 4.5 Kb so it still seems to save some space. – Eduardo Leoni Aug 10 '10 at 02:36
  • 2
    @Eduardo I got 4Kb vs 4.2Kb. For `N=100000` I got 391.5 Kb vs 391.8 Kb. So factor takes little more memory. – Marek Aug 10 '10 at 07:50
  • Update: there's some sort of string cache in R now that eliminates this consideration: http://stackoverflow.com/questions/18304760/why-character-is-often-preferred-to-factor-in-data-table-for-key – Frank Oct 16 '13 at 15:39
1

Factors are an excellent "unique-cases" badging engine. I've recreated this badly many times, and despite a couple of wrinkles occasionally, they are extremely powerful.

library(dplyr)
d <- tibble(x = sample(letters[1:10], 20, replace = TRUE))

## normalize this table into an indexed value across two tables
id <- tibble(x_u = sort(unique(d$x))) %>% mutate(x_i = row_number())
di <- tibble(x_i = as.integer(factor(d$x)))


## reconstruct d$x when needed
d2 <- inner_join(di, id) %>% transmute(x = x_u)
identical(d, d2)
## [1] TRUE

If there's a better way to do this task I'd love to see it, I don't see this capability of factor discussed.

mdsumner
  • 29,099
  • 6
  • 83
  • 91
0

Only with factors we can handle NAs by setting them as factor level. This is handy because many functions leave out NA values. Let's generate some toy data:

df <- data.frame(x= rnorm(10), g= c(sample(1:2, 9, replace= TRUE), NA))

If we want means of x grouped by g we can use

aggregate(x ~ g, df, mean)
  g          x
1 1  1.0415156
2 2 -0.3071171

As you can see we do not get the mean of x for the case where g is an NA. Same problem is true if we use by instead (see by(df$x, list(df$g), mean)). There are many other similiar examples where functions (by default or in general) do not consider NAs.

But we can add NA as a factor level. See here:

aggregate(x ~ addNA(g), df, mean)
  addNA(g)          x
1        1 -0.2907772
2        2 -0.2647040
3     <NA>  1.1647002

Yeah, we see the mean of x where g has NAs. One could argue that same output is possible with paste0 which is true (try aggregate(x ~ paste0(g), df, mean)). But only with addNA we can backtransform the NAs to actual missings. So let's firstly transform g with addNA and then backtransform it:

df$g_addNA <- addNA(df$g)
df$g_back <- factor(as.character(df$g_addNA))
 [1] 2    2    1    1    1    2    2    1    1    <NA>
Levels: 1 2

Now the NAs in g_back are actual missings. See any(is.na(df$g_back)) which returns a TRUE.

This even works in strange situations where "NA" was a value in the original vector! For example, the vector vec <- c("a", "NA", NA) can be transformed using vec_addNA <- addNA(vec) and we can actually backtransform this with

as.character(vec_addNA)
[1] "a"  "NA" NA

On the other hand, to my knowledge we can not backtransform vec_paste0 <- paste0(vec) because in vec_paste0 the "NA" and the NA are the same! See

vec_paste0
[1] "a"  "NA" "NA"

I started the answer with "Only with factors we can handle NAs by setting them as factor level.". In fact I would be careful using addNA but regardless of the risk associated with addNA the fact stands that there is no similiar option for characters.

-2

tapply (and aggregate) rely on factors. The information-to-effort ratio of these functions is very high.

For instance, in a single line of code (the call to tapply below) you can get mean price of diamonds by Cut and Color:

> data(diamonds, package="ggplot2")

> head(dm)

   Carat     Cut    Clarity Price Color
1  0.23     Ideal     SI2   326     E
2  0.21   Premium     SI1   326     E
3  0.23      Good     VS1   327     E


> tx = with(diamonds, tapply(X=Price, INDEX=list(Cut=Cut, Color=Color), FUN=mean))

> a = sort(1:diamonds(tx)[2], decreasing=T)  # reverse columns for readability

> tx[,a]

         Color
Cut         J    I    H    G    F    E    D
Fair      4976 4685 5136 4239 3827 3682 4291
Good      4574 5079 4276 4123 3496 3424 3405
Very Good 5104 5256 4535 3873 3779 3215 3470
Premium   6295 5946 5217 4501 4325 3539 3631
Ideal     4918 4452 3889 3721 3375 2598 2629
doug
  • 69,080
  • 24
  • 165
  • 199