why "character is often preferred to factor" in data.table for key?

Question

From data.table manual:

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

Isn't factor just integer which should be easier to do counting sort than character?

I think this can help From data.table FAQ 2.17 `Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benet of coverting to factor`. — agstudy, Aug 19 '13 at 00:22

score 14 · Accepted Answer · answered Aug 19 '13 at 21:21

Isn't factor just integer which should be easier to do counting sort than character?

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

This answer is somewhat dated -- what's changed since this is that `data.table` does a lot of things in parallel, _but this can't be done on character columns_ due to a restriction from R itself (in fact this restriction is related to the global cacheing of strings). So the factor-character choice is not so simple anymore -- if you have a low-cardinality (`uniqueN(x)/length(x)` is small) string column on which you are computing (grouping, subsetting, calculating) a lot, it may be worth switching to `factor` to take advantage of the parallelism. — MichaelChirico, Oct 11 '20 at 22:21

why "character is often preferred to factor" in data.table for key?

1 Answers1

Linked