faster complete() within grouped df, using tidyverse

Question

Yesterday I posted about how to expand/complete within group values.

Testing the solution on a minimal example df worked great, but on my real data, it's not computing after several hours.

My dplyr pipe flow looks like this:

mydf <- mydf |>
group_by_at(vars(id:trial_day)) |> 
  summarise_at(vars(bla:last_col()), sum) |> 
  complete(trial_day = 1:14)

I tried swapping out complete() for expand() but doing so results in the grouped vars only being kept, the other vars are dropped e.g.

df <- data.frame(
  id = rep('a', 5),
  x = 6:10,
  y = 5:1
)

# returns all cols
df |> 
  group_by(id) |> 
  complete(x = 1:10)

# only returns id and x, no y
df |> 
  group_by(id) |> 
  expand(x = 1:10)

But I'm not sure if expand would even be any faster.

I tried doing a right join onto a df ladder = data.frame(day = 1:14) but that resulted in the expanded rows having NA for the grouping vars, I wanted those to fill whenever an expansion took place.

Is there a faster way to get the same result as I do using complete()?

Related non-tidyverse: https://stackoverflow.com/questions/46712379/efficient-way-to-fill-time-series-per-group — Henrik, Apr 22 '23 at 09:13

score 2 · Accepted Answer · answered Apr 22 '23 at 08:52

2

If you are having a fixed value to expand across groups like in the example you have shared, you don't need to add group_by. We want to complete fixed values (1:10) irrespective of the group it is in.

If you remove group_by from your code the code is faster 3 times when compared it with grouped data. I have created a larger example to demonstrate this.

# New example data
df <- data.frame(
  id = sample(c(letters, LETTERS), 1e6, replace = TRUE),
  x = sample(9, 1e6, replace = TRUE),
  y = sample(5, 1e6, replace = TRUE)
)

# compare the results with and without group_by and make sure they are the same

library(tidyr)
library(dplyr)

res1 <- df |> group_by(id) |> complete(x = 1:10) |> ungroup() |> arrange(id)
res2 <- df |> complete(id, x = 1:10) |> arrange(id)

identical(res1, res2)
#[1] TRUE

# Check the performance
microbenchmark::microbenchmark(
  with_group = df |> group_by(id) |> complete(x = 1:10), 
  without_group = df |> complete(id, x = 1:10)
)

# Unit: milliseconds
#          expr       min        lq     mean   median       uq      max neval cld
#    with_group 257.97722 290.68443 354.1348 311.0921 344.7817 673.3979   100  a 
# without_group  76.98431  98.45134 144.6897 107.9547 116.0458 539.7276   100   b

answered Apr 22 '23 at 08:52

Ronak Shah

377,200
20
156
213

`complete(id, x = 1:10)` you mean put the grouping var id within complete() itself. OK, but if I actually have multiple grouping vars? Tried e.g. `complete(id:bla, x = 1:10)` but that didn't work – Doug Fir Apr 22 '23 at 08:57
1

"That didn't work" isn't very helpful. Your expected output is not clear to me. Have you looked at `nesting()`? Also, `group_by_at`, `summarise_at` and the like have been superseded. – Limey Apr 22 '23 at 09:09
Looking at docs, there's no explicit way I can see to pass multiple grouping vars https://tidyr.tidyverse.org/reference/complete.html – Doug Fir Apr 22 '23 at 09:15
1

@DougFir I couldn't find a way to use `:` or other tidyselect helper functions in `complete`. So what you need to do is `complete(id, col1, col2, bla, x = 1:10)` – Ronak Shah Apr 22 '23 at 09:15
1

@DougFir I found this method from here https://github.com/tidyverse/tidyr/issues/1397#issuecomment-1260747538 `df |> complete(!!!rlang::syms(df %>% select(id:bla) %>% names), x = 1:10)` – Ronak Shah Apr 22 '23 at 09:25

score -1 · Answer 2 · answered Apr 22 '23 at 10:17

We could use bind_rows() with swapping x and y:

library(dplyr)
bind_rows(df, df %>% 
            select(id, x=y, y=x))

output:

  id  x  y
1   a  6  5
2   a  7  4
3   a  8  3
4   a  9  2
5   a 10  1
6   a  5  6
7   a  4  7
8   a  3  8
9   a  2  9
10  a  1 10

data:

df <- data.frame(
  id = rep('a', 5),
  x = 6:10,
  y = 5:1
)

faster complete() within grouped df, using tidyverse

2 Answers2