4

This may seem like a silly question, but after working with R for a couple of months, I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • 5
    Read this [blog post](https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng on the history of stringAsFactors – phiver Sep 16 '18 at 07:09
  • i would say basically, if you're not careful, factors can lead to unexpected behavior. as to `tabulate`, any reason not to use `table`? – MichaelChirico Sep 16 '18 at 07:14
  • No, I was really just giving an example. But essentially the answer (according to the blog) is that there is not really a good reason unless your not a statistician? – Tom Sep 16 '18 at 07:18
  • @MichaelChirico Could you give an example of unexpected behaviour? – Tom Sep 16 '18 at 07:23

1 Answers1

6

Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical

Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

> x[y]
bar foo 
  2   1 
> x[z]
foo bar 
  1   2 

This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112