Performance of group_by on factor columns

Asked Jun 14 '18 at 07:41

Active Jun 14 '18 at 07:53

Viewed 47 times

I have a serious performance issue with a rather easy operation. In fact, I don't get any result at all, even after several hours of running the code.

My data frame consists of approximately 400k records of 10 variables. The code for the operation is:

a2 <- dat %>% 
  group_by(X1,X2,X3,X4) %>% 
  summarise(a = length(unique(ID)))

Where X1-X4 are all factors (1600 - 5600 levels). Could the issue be that my ID variable is also a factor (184573 levels)? If so, how can I fix this? I used similar code for a data frame where ID was a int and that worked fine.

However, with my current dataset changing to int is not possible and changing to chr does not seem to make sense. Does anyone have an answer?

edited Jun 14 '18 at 07:53

zx8754

52,746
12
114
209

asked Jun 14 '18 at 07:41

Joep_S

1

try to solve it with `data.table` and don't use factors. – Andre Elrico Jun 14 '18 at 07:45
2

Could you provide example data, `dput(head(dat))` or just `head(dat)` if too many factors? Yes, I would keep IDs as int or char, not factor. – zx8754 Jun 14 '18 at 07:47
3

Also, try using `a = n_distinct(ID)` instead of `a = length(unique(ID))`. – zx8754 Jun 14 '18 at 07:50
Relevant post, to see other package options: https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame – zx8754 Jun 14 '18 at 07:53
On my aging PC it took 1.211559 seconds (median), timed with `microbenchmark::microbenchmark`, data.frame of the size in the question, using `a = n_distinct(ID)`. – Rui Barradas Jun 14 '18 at 09:33

Performance of group_by on factor columns

0 Answers0