0

I've got a dplyr query which runs on a large data frame and it's painfully slow. Reprex:

Start with a dataframe df which has duplicate rows (because it was formed by a left_join() call). If I see a duplicate index value then the name, year and city values will be duplicated too.

df <- data.frame(index  = c(1, 1, 1, 2, 2, 3),
                 name   = c("Andy", "Andy", "Andy", "Bob", "Bob", "Charles"),
                 year   = c(1970, 1970, 1970, 1971, 1971, 1972),
                 city   = c("Amsterdam", "Amsterdam", "Amsterdam", "Barcelona", "Barcelona", "Copenhagen"),
                 amount = c(123, 234, 345, 456, 567, 678))

I want to know the sum of the amount field for each value of index. However I want to retain name, year and city.

output_i_want <- data.frame(index        = c(1, 2, 3),
                            name         = c("Andy", "Bob", "Charles"),
                            year         = c(1970, 1971, 1972),
                            city         = c("Amsterdam", "Barcelona", "Copenhagen"),
                            total_amount = c(702, 1023, 678))

It's easy enough to do it like this:

df |>
  group_by(index) |>
  summarise(name         = first(name),
            year         = first(year),
            city         = first(city),
            total_amount = sum(amount)) |>
  ungroup()

...but in my real world case (where first() appears about 20 times and sum() appears 8 times) it's horribly slow.

If I instead do:

df |>
  group_by(index) |>
  summarise(total_amount = sum(amount)) |>
  ungroup()

then it runs fast, but I then lose name, year and city - and I'm not sure how best to get them back. Do I need a different type of join afterwards, or some other technique?

Thank you.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Alan
  • 619
  • 6
  • 19

1 Answers1

0

1) Since the columns involved with first in the question are constant within index try grouping by all 4 to eliminate the need for first. The code below assumes all non-grouping columns are to be summed but you could specify something like where(is.numeric) if all numeric non-grouping columns are to be summed or amount1:amount3 if the columns are side by side or starts_with("amount") if they all start with amount.

library(dplyr) # version 1.1.0 or later

df %>%
  summarize(across(everything(), sum, .names = "total_{.col}"), .by = index:city)

giving:

  index    name year       city total_amount
1     1    Andy 1970  Amsterdam          702
2     2     Bob 1971  Barcelona         1023
3     3 Charles 1972 Copenhagen          678

2) A base solution using the same idea is the following. Change the numbers as needed. Omit the last 2 statements if having a total_ prefix is not important.

ag <- aggregate(df[5], df[-5], sum)
names(ag)[5] <- paste("total", names(ag)[5], sep = "_")
ag

giving:

  index    name year       city total_amount
1     1    Andy 1970  Amsterdam          702
2     2     Bob 1971  Barcelona         1023
3     3 Charles 1972 Copenhagen          678
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341