3

So are there any assumptions made when choosing to factorize a column in R? I ask this because I have character columns that if converted to factors, would have too many levels for things such as randomForest. Is there a disadvantage to having them kept as characters?

CJava
  • 189
  • 1
  • 5
  • 15

1 Answers1

4

I usually like to keep my variables as character rather than factors for most of a project (eg reading, cleaning, manipulating). I typically only transfer them to factors prior to analysis. As it stands, the main reason I know of for explicitly using factor variable storage is to explicitly control the base level in analysis such as controlling the left out category in a linear model with dummies.

It used to be the case (a number of years ago) that the biggest advantage for keeping variables as factors was to save memory. A factor variable was more or less stored as an integer vector, which took up a lot less space compared to a character vector, especially when there were repeated elements. As @MichaelChirico pointed out to me below, this has not been the case for quite a while (somewhere around version 2.8).

lmo
  • 37,904
  • 9
  • 56
  • 69