How to insert missing dates/times using R based on criteria?

Question

A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).

Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.

There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.

I tried these lines from this question (even though they fill all without differentiation) but not getting there.

How can it be done in R?

date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)

Did you meant `library(dplyr);library(tidyr);df %>% mutate(date_time = dmy_hm(date_time)) %>% complete(date_time = seq(from = min(date_time), to = max(date_time), by = "1 hour"), staff, fill = list(reading = 0))` — akrun, Mar 05 '18 at 09:28
@akrun, good morning sir. thank you for your comment. it gives me: Evaluation error: could not find function "dmy_hm"... — Mark K, Mar 05 '18 at 09:34
@akrun. a nice one but it fills for everybody for the 3 days. I want 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate. — Mark K, Mar 05 '18 at 09:39
Looks like `df %>% mutate(date_time = dmy_hm(date_time), date = as.Date(date_time)) %>% complete(date_time = seq(as.POSIXct(paste(date[1], "00:00:00"), tz = "GMT"), length.out = 24, by = "1 hour"), staff, fill = list(reading = 0)) %>% select(-date)` — akrun, Mar 05 '18 at 09:41
Do you need `df %>% mutate(date_time = dmy_hm(date_time), date = as.Date(date_time)) %>% split(.$date, .$staff, drop = TRUE) %>% map_df(~ .x %>% group_by(staff) %>% complete(date_time = seq(as.POSIXct(paste(date[1], "00:00:00"), tz = "GMT"), length.out = 24, by = "1 hour"), fill = list(reading = 0))) %>% select(-date)` — akrun, Mar 05 '18 at 10:05
@akrun, thanks again. please see the question edited with wanted result. (your latest code fill everybody for all days) — Mark K, Mar 05 '18 at 12:04
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166239/discussion-between-mark-k-and-akrun). — Mark K, Mar 05 '18 at 12:06

Terru_theTerror · Answer 1 · 2018-03-05T09:55:34.140

1

Try this code:

Identify each daily hour and all staff members

date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")

Uniform date format in df

df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")

Merge information

out<-merge(comb,df,all.x=T)

Your output:

head(out)
  staff           date_time reading
1  Jack 2018-03-02 00:00:00     7.5
2  Jack 2018-03-02 01:00:00     8.3
3  Jack 2018-03-02 02:00:00      NA
4  Jack 2018-03-02 03:00:00     6.9
5  Jack 2018-03-02 04:00:00      NA
6  Jack 2018-03-02 05:00:00     7.1

edited Mar 05 '18 at 09:55

answered Mar 05 '18 at 09:44

Terru_theTerror

4,918
2
20
39

,thank you. but it fills for everybody for 2018-03-02 and 2018-03-03. 2018-03-04 shows uneven numbers for everybody... – Mark K Mar 05 '18 at 09:51
Ok, try all.x=T – Terru_theTerror Mar 05 '18 at 09:56
Trouble could be related to this kind of repeated data: 2018-03-04 00:00:00 David 6.7 - 2018-03-04 00:00:00 David 7.0; same data same staff member – Terru_theTerror Mar 05 '18 at 09:57

akrun · Accepted Answer · 2018-03-05T12:50:07.863

The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data

library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"), 
       tz = "GMT"),
           length.out = 24, by = "1 hour")), staff]

setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[, 
     .(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
         .(staff, date_time)]

table(res$staff, as.Date(res$date_time))

#         2018-03-02 2018-03-03 2018-03-04
#  David          3         24          2
#  Jack          24          1          1
#  Kate           3          1         24

head(res)
#   staff           date_time reading
#1:  Jack 2018-03-02 00:00:00     7.5
#2:  Jack 2018-03-02 01:00:00     8.3
#3:  Kate 2018-03-02 02:00:00     7.0
#4:  Jack 2018-03-02 03:00:00     6.9
#5:  Jack 2018-03-02 05:00:00     7.1
#6:  Jack 2018-03-02 06:00:00     8.1

tail(res)
#   staff           date_time reading
#1:  Kate 2018-03-04 04:00:00       0
#2:  Kate 2018-03-04 09:00:00       0
#3:  Kate 2018-03-04 13:00:00       0
#4:  Kate 2018-03-04 18:00:00       0
#5:  Kate 2018-03-04 21:00:00       0
#6:  Kate 2018-03-04 23:00:00       0

thanks. but the date_time appended is value like 1520200800, instead of regular date/time format. could you please have a look? — Mark K, Mar 05 '18 at 12:47
@MarkK I think I know why it was doing that for you bcz I forgot to paste the date_time conversion line in my code. Could you please check it now? — akrun, Mar 05 '18 at 12:53
by the way, when the original data is big, there's no a way to manually create "stf" and "date". is there a way to generate "stf" and "date" smartly? — Mark K, Mar 05 '18 at 13:01
@MarkK Yes, you are get it done by `unique(df$staff)` and the corresponding dates it seems to be custom made — akrun, Mar 05 '18 at 13:04

How to insert missing dates/times using R based on criteria?

2 Answers2