Is there any good reason for columns to be characters instead of factors?

Question

This may seem like a silly question, but after working with R for a couple of months, I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?

Read this [blog post](https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng on the history of stringAsFactors — phiver, Sep 16 '18 at 07:09
i would say basically, if you're not careful, factors can lead to unexpected behavior. as to `tabulate`, any reason not to use `table`? — MichaelChirico, Sep 16 '18 at 07:14
No, I was really just giving an example. But essentially the answer (according to the blog) is that there is not really a good reason unless your not a statistician? — Tom, Sep 16 '18 at 07:18
@MichaelChirico Could you give an example of unexpected behaviour? — Tom, Sep 16 '18 at 07:23

score 6 · Accepted Answer · answered Sep 16 '18 at 07:49

Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical

Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

> x[y]
bar foo 
  2   1 
> x[z]
foo bar 
  1   2

This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.

Is there any good reason for columns to be characters instead of factors?

1 Answers1

Linked