1

I have a serious performance issue with a rather easy operation. In fact, I don't get any result at all, even after several hours of running the code.

My data frame consists of approximately 400k records of 10 variables. The code for the operation is:

a2 <- dat %>% 
  group_by(X1,X2,X3,X4) %>% 
  summarise(a = length(unique(ID)))

Where X1-X4 are all factors (1600 - 5600 levels). Could the issue be that my ID variable is also a factor (184573 levels)? If so, how can I fix this? I used similar code for a data frame where ID was a int and that worked fine.

However, with my current dataset changing to int is not possible and changing to chr does not seem to make sense. Does anyone have an answer?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Joep_S
  • 481
  • 4
  • 22
  • 1
    try to solve it with `data.table` and don't use factors. – Andre Elrico Jun 14 '18 at 07:45
  • 2
    Could you provide example data, `dput(head(dat))` or just `head(dat)` if too many factors? Yes, I would keep IDs as int or char, not factor. – zx8754 Jun 14 '18 at 07:47
  • 3
    Also, try using `a = n_distinct(ID)` instead of `a = length(unique(ID))`. – zx8754 Jun 14 '18 at 07:50
  • Relevant post, to see other package options: https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame – zx8754 Jun 14 '18 at 07:53
  • On my aging PC it took 1.211559 seconds (median), timed with `microbenchmark::microbenchmark`, data.frame of the size in the question, using `a = n_distinct(ID)`. – Rui Barradas Jun 14 '18 at 09:33

0 Answers0