1

I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.

The data I want to extract:

ID monthly value
1 - 1997 Product of Monthly Values in Year 1997
1 - 1998 Product of Monthly Values in Year 1998
1 - 1999 Product of Monthly Values in Year 1999
... ...
2 - 1997 Product of Monthly Values in Year 1997
2 - 1998 Product of Monthly Values in Year 1998
2 - 1999 Product of Monthly Values in Year 1999
... ...

The dataset which is the source:

ID monthly value
1 - 1997 Monthly Value 1 in Year 1997
1 - 1997 Monthly Value 2 in Year 1997
1 - 1997 Monthly Value 3 in Year 1997
... ...
2 - 1997 Monthly Value 1 in Year 1997
2 - 1997 Monthly Value 2 in Year 1997
2 - 1997 Monthly Value 3 in Year 1997
... ...

I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.

for (i in 1:nrow(yearlyreturns)){
  
  yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
  yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
  
}

I don't know how to get from here to a vectorised function, which takes less time.

Is this possible to do in R?

3 Answers3

1

Something like this:

library(dplyr)

df %>% 
  mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>% 
  group_by(ID, monthly_value) %>% 
  summarise()
  ID       monthly_value                         
  <chr>    <chr>                                 
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997

data:

structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997", 
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997", 
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997", 
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997", 
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA, 
-6L))
TarJae
  • 72,363
  • 6
  • 19
  • 66
1

Based on the for loop code, this may be a done with a join

library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount") 
     := .(prod(change) -1, .N), on = .(ID), by = .EACHI]
akrun
  • 874,273
  • 37
  • 540
  • 662
0

In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.

Norbs
  • 1
  • thanks for your feedback. As soon as you'll be able to comment, please note that links to other posts should be sent as comments and not as an answer. – Waldi Feb 21 '22 at 15:20