5

I am attempting to fill some values in a data set. A simplified version of my data can be found below:

    > example_df
           Date    GROUP   value
157     2018-01-31 10180  3.464
158     2018-02-28 10180  3.413
159     2018-03-31 10180  3.418
160     2018-04-30 10180     NA
161     2018-05-31 10180     NA
162     2018-06-30 10180     NA
163     2018-07-31 10180     NA
164     2018-08-31 10180     NA
165     2018-09-30 10180     NA
166     2018-10-31 10180     NA
167     2018-11-30 10180     NA
168     2018-12-31 10180     NA
169     2019-01-31 10180     NA
170     2019-02-28 10180     NA
171     2019-03-31 10180     NA
172     2019-04-30 10180     NA
173     2019-05-31 10180     NA
174     2019-06-30 10180     NA
175     2019-07-31 10180     NA
176     2019-08-31 10180     NA
177     2019-09-30 10180     NA
178     2019-10-31 10180     NA
179     2019-11-30 10180     NA
373     2018-01-31 10420  5.085
374     2018-02-28 10420  5.051
375     2018-03-31 10420  4.993
376     2018-04-30 10420     NA
377     2018-05-31 10420     NA
378     2018-06-30 10420     NA
379     2018-07-31 10420     NA
380     2018-08-31 10420     NA
381     2018-09-30 10420     NA
382     2018-10-31 10420     NA
383     2018-11-30 10420     NA
384     2018-12-31 10420     NA
385     2019-01-31 10420     NA
386     2019-02-28 10420     NA
387     2019-03-31 10420     NA
388     2019-04-30 10420     NA
389     2019-05-31 10420     NA
390     2019-06-30 10420     NA
391     2019-07-31 10420     NA
392     2019-08-31 10420     NA
393     2019-09-30 10420     NA
394     2019-10-31 10420     NA
395     2019-11-30 10420     NA
589     2018-01-31 10500  5.796
590     2018-02-28 10500  5.860
591     2018-03-31 10500  5.913
592     2018-04-30 10500     NA
593     2018-05-31 10500     NA
594     2018-06-30 10500     NA
595     2018-07-31 10500     NA
596     2018-08-31 10500     NA
597     2018-09-30 10500     NA
598     2018-10-31 10500     NA
599     2018-11-30 10500     NA
600     2018-12-31 10500     NA
601     2019-01-31 10500     NA
602     2019-02-28 10500     NA
603     2019-03-31 10500     NA
604     2019-04-30 10500     NA
605     2019-05-31 10500     NA
606     2019-06-30 10500     NA
607     2019-07-31 10500     NA
608     2019-08-31 10500     NA
609     2019-09-30 10500     NA
610     2019-10-31 10500     NA
611     2019-11-30 10500     NA

As you can see. For each group I have values up to today's month, and then a set of NAs until the start of the next group. What I would like to do would be, for each group, fill these NAs with a sequence which originates from the last non-NULL value and increases by a fixed value (I have chosen 0.065) until the final date of the group. I would prefer a dplyr solution, but any information on how to achieve this would be very helpful. Thank you.

Justin Klevs
  • 651
  • 6
  • 17
  • 1
    https://stackoverflow.com/questions/40040834/replace-na-with-previous-or-next-value-by-group-using-dplyr – BENY Mar 29 '18 at 18:12

3 Answers3

2

You could do something like this (inspired by Frank and eddi's comments/solutions):

df$value2 <- ave(df$value, df$GROUP, cumsum(!is.na(df$value)),
                 FUN = function(x) x[1] + 0.065 * (1:length(x) - 1))

Or my original ave:

df$value2 <- ave(df$value, df$GROUP, 
                 FUN = function(x) {nas_to_replace <- is.na(x) & seq_along(x) > tail(which(!is.na(x)),1)
                                    replace(x,  nas_to_replace, 
                                                tail(x[!is.na(x)],1) + 0.065*(1:sum(nas_to_replace)))} )

This function is designed to replace only the NAs that come after the last non-NA. So if you have a vector like c(NA, 1, 2, NA, NA) it would replace only the last two elements.

head(df)
#      Date GROUP value value2
#1 2018-01-31 10180 3.464  3.464
#2 2018-02-28 10180 3.413  3.413
#3 2018-03-31 10180 3.418  3.418
#4 2018-04-30 10180    NA  3.483
#5 2018-05-31 10180    NA  3.548
#6 2018-06-30 10180    NA  3.613
Mike H.
  • 13,960
  • 2
  • 29
  • 39
2
library(data.table)
dt = as.data.table(yourdf) # or convert in place using setDT

dt[, value := value[1] + 0.065 * (1:.N - 1)
   , by = .(GROUP, cumsum(!is.na(value)))]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Hi @eddi Thanks for the assistance, this is perfect. Do you have any suggestions for doing this same sort of operation, but instead of increasing by a fixed value, have the value increase by "x" percentage over n months? For instance, a 10% increase over 18 months. – Justin Klevs Mar 29 '18 at 18:40
  • 2
    Or `seq`, something like `DF[, v := seq(first(value), by=0.065, length.out=.N), by=.(GROUP, rleid(GROUP, cumsum(!is.na(value))))]` – Frank Mar 29 '18 at 18:41
  • @JustinKlevs it's a little unclear to me what you mean - maybe use `seq` as Frank suggested and instead of `by`, set the end value? – eddi Mar 29 '18 at 18:52
  • Hi @eddi. Thanks for the suggestion. In reference to your previous suggestion for percentage increases. I had something slightly different in mind. Instead of increasing by 10% for each observation in the sequence, to increase by a total of 10% over a set amount of months or the total period for which NAs exist. All this info has already been very valuable, thank you – Justin Klevs Mar 29 '18 at 18:53
  • 1
    @JustinKlevs `seq(first(value), first(value)*1.1, length.out=.N)` or `n()` instead of `.N` if you're in dplyr instead of data.table. You could read `?seq` for many variations. – Frank Mar 29 '18 at 18:55
1

Not as elegant as the data.table solution, but using dplyr and building on this answer, you could do something like:

library(dplyr)
df %>% 
  group_by(GROUP, tmp=cumsum(!is.na(value))) %>%
  mutate(value=value[1] + 0.065*(0:(length(value)-1))) %>% 
  ungroup() %>% 
  select(-tmp)
sbha
  • 9,802
  • 2
  • 74
  • 62