Improving the conditional calculation - large dataframe

Question

I got a dataframe which looks like this:

       df = data.frame(id=c(1,1,1,2,2,2,3,3,3),date=rep(c("1990-01","1990-02","1990-03"),3),
                        pd=c(0.005,0.004,0.003,0.001,0.0005,0.002,0.008,0.0065,0.002))

    df
   id    date     pd
#  1 1990-01 0.0050
#  1 1990-02 0.0040
#  1 1990-03 0.0030
#  2 1990-01 0.0010
#  2 1990-02 0.0005
#  2 1990-03 0.0020
#  3 1990-01 0.0080
#  3 1990-02 0.0065
#  3 1990-03 0.0020

The id refers to different companies. I'd like to calculate the 'distance to default' (-qnorm(pd_t) - (-qnorm(pd_t-1)) conditioned on date and id.

My code produces the output I am looking for but takes very long due to the size of the real dataframe:

id_vec = c(1:3)
df$DD = NA 
for(i in 1:3){
df[df$id==id_vec[i],] = df[df$id==id_vec[i],] %>% mutate(DD = -qnorm(pd)-lag(-qnorm(pd)))}
      id    date     pd          DD
    #  1 1990-01 0.0050          NA
    #  1 1990-02 0.0040  0.07624050
    #  1 1990-03 0.0030  0.09571158
    #  2 1990-01 0.0010          NA
    #  2 1990-02 0.0005  0.20029443
    #  2 1990-03 0.0020 -0.41236499
    #  3 1990-01 0.0080          NA
    #  3 1990-02 0.0065  0.07485375
    #  3 1990-03 0.0020  0.39439245

Does anyone has an idea how I can improve the performance?

df %>% group_by(id) %>% mutate(DD = -qnorm(pd)-lag(-qnorm(pd))) — rar, Jul 10 '18 at 11:46
@rar: valid solution, but data.table is still faster (see benchmarks below) — Wimpel, Jul 10 '18 at 11:57

Wimpel · Accepted Answer · 2018-07-10T11:57:34.517

data.table will do the trick about 10x faster

setDT(df)[, DD := -qnorm(pd) - -qnorm( shift(pd, n=1, fill = NA, type = "lag")), by = "id"]

#    id    date     pd          DD
# 1:  1 1990-01 0.0050          NA
# 2:  1 1990-02 0.0040  0.07624050
# 3:  1 1990-03 0.0030  0.09571158
# 4:  2 1990-01 0.0010          NA
# 5:  2 1990-02 0.0005  0.20029443
# 6:  2 1990-03 0.0020 -0.41236499
# 7:  3 1990-01 0.0080          NA
# 8:  3 1990-02 0.0065  0.07485375
# 9:  3 1990-03 0.0020  0.39439245

benchmarks

library(microbenchmark)

microbenchmark(
  loop = {
    id_vec = c(1:3)
    df$DD = NA 
    for(i in 1:3){
      df[df$id==id_vec[i],] = df[df$id==id_vec[i],] %>% mutate(DD = -qnorm(pd)-lag(-qnorm(pd)))
      }
  },
  dplyr={df %>% group_by(id) %>% mutate(DD = -qnorm(pd)-lag(-qnorm(pd)))},
  data.table = {setDT(df)[, DD := -qnorm(pd) - -qnorm( shift(pd, n=1, fill = NA, type = "lag")), by = "id"]},
  times = 10
)

# Unit: microseconds
#       expr       min        lq      mean    median        uq       max neval
#       loop 10718.040 10781.305 11500.511 11583.267 11739.021 13180.565    10
#      dplyr  1146.005  1229.456  1341.345  1272.838  1333.994  1893.740    10
# data.table   888.426   966.152  1074.094   986.336  1052.915  1763.293    10

Improving the conditional calculation - large dataframe

1 Answers1

benchmarks