So are there any assumptions made when choosing to factorize a column in R? I ask this because I have character columns that if converted to factors, would have too many levels for things such as randomForest. Is there a disadvantage to having them kept as characters?
1 Answers
I usually like to keep my variables as character rather than factors for most of a project (eg reading, cleaning, manipulating). I typically only transfer them to factors prior to analysis. As it stands, the main reason I know of for explicitly using factor variable storage is to explicitly control the base level in analysis such as controlling the left out category in a linear model with dummies.
It used to be the case (a number of years ago) that the biggest advantage for keeping variables as factors was to save memory. A factor variable was more or less stored as an integer vector, which took up a lot less space compared to a character vector, especially when there were repeated elements. As @MichaelChirico pointed out to me below, this has not been the case for quite a while (somewhere around version 2.8).

- 37,904
- 9
- 56
- 69
-
2Not really true that they save memory, see: http://stackoverflow.com/a/13570765/3576984 – MichaelChirico Apr 08 '16 at 19:35
-
1Historical context never hurts. – lmo Apr 08 '16 at 20:02
-
yes i too agree about historical context. very neat explanation. – Manoj Kumar Apr 08 '16 at 21:07