Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.
One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels
x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y) # default levels are "bar", "foo", i.e., alphabetical
Subsetting x
by y
matches character value to name, but subsetting x
by z
uses the underlying level encoding.
> x[y]
bar foo
2 1
> x[z]
foo bar
1 2
This can be made even more confusing because R can work in different locales (e.g., I am using en_US
locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.