1

I have a big data.frame of time-series (>5*10^6 rows) (df_ts) and a smaller data.frame with manual adjustments for the time-series (df_adj):

# Example data
sq1 <- seq.Date(as.Date("2017-01-01"), as.Date("2017-01-15"),by = "days")

df_ts <- data.frame(Station = rep(c("A", "B"), each = length(sq1)),
                  Date = rep(sq1, 2),
                  Value = rnorm(n = 2* length(sq1)),
                  ValueAdj = 0)

df_adj <- data.frame(Station = c("A", "A", "B"),
                  Start = as.Date(c("2017-01-02", "2017-01-10", "2017-01-05")),
                  End = as.Date(c("2017-01-02", "2017-01-13", "2017-01-05")),
                  Adj = c(NA, 0.5, -0.6))

Time-steps are not necessary equal as in the example data

I want to change df_ts$ValueAdj to values from df_adj$Adj for periods defined by df_adj$Start and df_adj$End. Dates are inclusive and values in df_adj$Adj are numeric or NA.

Below is my solution with a for loop:

for(i in 1:nrow(df_adj)){
  df_ts[df_ts$Station == df_adj$Station[i] & 
        df_ts$Date >= df_adj$Start[i] & 
        df_ts$Date <= df_adj$End[i], "ValueAdj"] <- 
    df_adj$Adj[i]
}

Returns:

   Station       Date      Value ValueAdj
1        A 2017-01-01 -1.9049554      0.0
2        A 2017-01-02 -0.8111702       NA
3        A 2017-01-03  1.3240043      0.0
4        A 2017-01-04  0.6156368      0.0
5        A 2017-01-05  1.0916690      0.0
6        A 2017-01-06  0.3066049      0.0
7        A 2017-01-07 -0.1101588      0.0
8        A 2017-01-08 -0.9243128      0.0
9        A 2017-01-09  1.5929138      0.0
10       A 2017-01-10  0.0450106      0.5
11       A 2017-01-11 -0.7151284      0.5
12       A 2017-01-12  0.8652231      0.5
13       A 2017-01-13  1.0744410      0.5
14       A 2017-01-14  1.8956548      0.0
15       A 2017-01-15 -0.6029973      0.0
16       B 2017-01-01 -0.3908678      0.0
17       B 2017-01-02 -0.4162220      0.0
18       B 2017-01-03 -0.3756574      0.0
19       B 2017-01-04 -0.3666309      0.0
20       B 2017-01-05 -0.2956775     -0.6
21       B 2017-01-06  1.4418204      0.0
22       B 2017-01-07 -0.6975383      0.0
23       B 2017-01-08 -0.3881675      0.0
24       B 2017-01-09  0.6525365      0.0
25       B 2017-01-10  1.1247724      0.0
26       B 2017-01-11 -0.7721108      0.0
27       B 2017-01-12 -0.5080862      0.0
28       B 2017-01-13  0.5236206      0.0
29       B 2017-01-14  1.0177542      0.0
30       B 2017-01-15 -0.2511646      0.0

I am looking for a neater (and faster) solution preferably with dplyr/tidyr. The solution with join %>% filter %>% select like in the here doesn't work in my case because replacement value could also be NA. I am also not sure how would that perform on a big data.frame.

pavlopavlin
  • 120
  • 7

1 Answers1

0

You can perform a fuzzy-join :

library(fuzzyjoin)
library(dplyr)

df_ts %>%
  fuzzy_left_join(df_adj, 
                  by = c('Station', 'Date' = 'Start', 'Date' = 'End'), 
                  match_fun = c(`==`, `>=`, `<=`)) %>%
  transmute(Station = Station.x, 
            Date, Value,
            ValueAdj = ifelse(is.na(Start), ValueAdj,Adj))

#   Station       Date       Value ValueAdj
#1        A 2017-01-01  0.05975332      0.0
#2        A 2017-01-02  0.25582488       NA
#3        A 2017-01-03  1.07000418      0.0
#4        A 2017-01-04  0.98506119      0.0
#5        A 2017-01-05  0.31073550      0.0
#6        A 2017-01-06  0.46743876      0.0
#7        A 2017-01-07  0.97641285      0.0
#8        A 2017-01-08 -1.55653207      0.0
#9        A 2017-01-09  1.31032057      0.0
#10       A 2017-01-10 -0.92685638      0.5
#11       A 2017-01-11  0.22341808      0.5
#12       A 2017-01-12  1.74218972      0.5
#13       A 2017-01-13  0.23301090      0.5
#14       A 2017-01-14 -0.67630001      0.0
#15       A 2017-01-15 -0.41174254      0.0
#16       B 2017-01-01 -0.05273856      0.0
#17       B 2017-01-02 -1.01259196      0.0
#18       B 2017-01-03 -0.02470174      0.0
#19       B 2017-01-04 -0.08177288      0.0
#20       B 2017-01-05 -0.21861024     -0.6
#21       B 2017-01-06  0.38780606      0.0
#22       B 2017-01-07 -0.07041152      0.0
#23       B 2017-01-08 -0.47037903      0.0
#24       B 2017-01-09 -0.47601009      0.0
#25       B 2017-01-10 -0.55555959      0.0
#26       B 2017-01-11 -0.55813517      0.0
#27       B 2017-01-12 -0.51827499      0.0
#28       B 2017-01-13 -0.68259537      0.0
#29       B 2017-01-14 -0.89242106      0.0
#30       B 2017-01-15  0.26492938      0.0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Tried your solution on a subset of my data and after 5 min get: Error: cannot allocate vector of size 1.1 Gb. For comparison, on the same subset my for loop solution takes 12s. – pavlopavlin Mar 08 '21 at 11:07