I've got a dplyr
query which runs on a large data frame and it's painfully slow. Reprex:
Start with a dataframe df
which has duplicate rows (because it was formed by a left_join()
call). If I see a duplicate index value then the name, year and city values will be duplicated too.
df <- data.frame(index = c(1, 1, 1, 2, 2, 3),
name = c("Andy", "Andy", "Andy", "Bob", "Bob", "Charles"),
year = c(1970, 1970, 1970, 1971, 1971, 1972),
city = c("Amsterdam", "Amsterdam", "Amsterdam", "Barcelona", "Barcelona", "Copenhagen"),
amount = c(123, 234, 345, 456, 567, 678))
I want to know the sum of the amount field for each value of index. However I want to retain name
, year
and city
.
output_i_want <- data.frame(index = c(1, 2, 3),
name = c("Andy", "Bob", "Charles"),
year = c(1970, 1971, 1972),
city = c("Amsterdam", "Barcelona", "Copenhagen"),
total_amount = c(702, 1023, 678))
It's easy enough to do it like this:
df |>
group_by(index) |>
summarise(name = first(name),
year = first(year),
city = first(city),
total_amount = sum(amount)) |>
ungroup()
...but in my real world case (where first()
appears about 20 times and sum()
appears 8 times) it's horribly slow.
If I instead do:
df |>
group_by(index) |>
summarise(total_amount = sum(amount)) |>
ungroup()
then it runs fast, but I then lose name
, year
and city
- and I'm not sure how best to get them back. Do I need a different type of join afterwards, or some other technique?
Thank you.