1

A data frame like below. 3 staffs have hourly readings in days, but incomplete (every staff shall have 24 readings a day).

Understand that staffs had different number of readings on the days. Now only interested in the staff with most readings in the day.

enter image description here

There are many days. It’s wanted to insert the missing (hourly) rows for the most ones on the days. That is, 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate.

I tried these lines from this question (even though they fill all without differentiation) but not getting there.

How can it be done in R?

date_time <- c("2/3/2018 0:00","2/3/2018 1:00","2/3/2018 2:00","2/3/2018 3:00","2/3/2018 5:00","2/3/2018 6:00","2/3/2018 7:00","2/3/2018 8:00","2/3/2018 9:00","2/3/2018 10:00","2/3/2018 11:00","2/3/2018 12:00","2/3/2018 13:00","2/3/2018 14:00","2/3/2018 16:00","2/3/2018 17:00","2/3/2018 18:00","2/3/2018 19:00","2/3/2018 21:00","2/3/2018 22:00","2/3/2018 23:00","3/3/2018 0:00","3/3/2018 0:00","3/3/2018 1:00","3/3/2018 2:00","3/3/2018 4:00","3/3/2018 5:00","3/3/2018 7:00","3/3/2018 8:00","3/3/2018 9:00","3/3/2018 11:00","3/3/2018 12:00","3/3/2018 14:00","3/3/2018 15:00","3/3/2018 17:00","3/3/2018 18:00","3/3/2018 20:00","3/3/2018 22:00","3/3/2018 23:00","4/3/2018 0:00","4/3/2018 0:00","4/3/2018 1:00","4/3/2018 2:00","4/3/2018 3:00","4/3/2018 5:00","4/3/2018 6:00","4/3/2018 7:00","4/3/2018 8:00","4/3/2018 10:00","4/3/2018 11:00","4/3/2018 12:00","4/3/2018 14:00","4/3/2018 15:00","4/3/2018 16:00","4/3/2018 17:00","4/3/2018 19:00","4/3/2018 20:00","4/3/2018 22:00","4/3/2018 23:00")
staff <- c("Jack","Jack","Kate","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Jack","Kate","Jack","Jack","Jack","David","David","Jack","Kate","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","David","Jack","Kate","David","David","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Kate","Jack")
reading <- c(7.5,8.3,7,6.9,7.1,8.1,8.4,8.8,6,7.1,8.9,7.3,7.4,6.9,11.3,18.8,4.6,6.7,7.7,7.8,7,7,6.6,6.8,6.7,6.1,7.1,6.3,7.2,6,5.8,6.6,6.5,6.4,7.2,8.4,6.5,6.5,5.5,6.7,7,7.5,6.5,7.5,7.2,6.3,7.3,8,7,8.2,6.5,6.8,7.5,7,6.1,5.7,6.7,4.3,6.3)
df <- data.frame(date_time, staff, reading)
halfer
  • 19,824
  • 17
  • 99
  • 186
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 2
    Did you meant `library(dplyr);library(tidyr);df %>% mutate(date_time = dmy_hm(date_time)) %>% complete(date_time = seq(from = min(date_time), to = max(date_time), by = "1 hour"), staff, fill = list(reading = 0))` – akrun Mar 05 '18 at 09:28
  • @akrun, good morning sir. thank you for your comment. it gives me: Evaluation error: could not find function "dmy_hm"... – Mark K Mar 05 '18 at 09:34
  • 1
    It is from `lubridate`. I forgot – akrun Mar 05 '18 at 09:35
  • @akrun. a nice one but it fills for everybody for the 3 days. I want 2018-03-02 to insert only for Jack’s, 2018-03-03 only for David and 2018-03-04 only for Kate. – Mark K Mar 05 '18 at 09:39
  • 1
    Looks like `df %>% mutate(date_time = dmy_hm(date_time), date = as.Date(date_time)) %>% complete(date_time = seq(as.POSIXct(paste(date[1], "00:00:00"), tz = "GMT"), length.out = 24, by = "1 hour"), staff, fill = list(reading = 0)) %>% select(-date)` – akrun Mar 05 '18 at 09:41
  • @akrun, it does for everybody only for 2018-03-02... – Mark K Mar 05 '18 at 09:46
  • Can you show the expected frequency – akrun Mar 05 '18 at 09:53
  • Do you need `df %>% mutate(date_time = dmy_hm(date_time), date = as.Date(date_time)) %>% split(.$date, .$staff, drop = TRUE) %>% map_df(~ .x %>% group_by(staff) %>% complete(date_time = seq(as.POSIXct(paste(date[1], "00:00:00"), tz = "GMT"), length.out = 24, by = "1 hour"), fill = list(reading = 0))) %>% select(-date)` – akrun Mar 05 '18 at 10:05
  • @akrun, thanks again. please see the question edited with wanted result. (your latest code fill everybody for all days) – Mark K Mar 05 '18 at 12:04
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166239/discussion-between-mark-k-and-akrun). – Mark K Mar 05 '18 at 12:06
  • Thanks for clearing the doubts, i posted a soution below – akrun Mar 05 '18 at 12:36

2 Answers2

1

Try this code:

Identify each daily hour and all staff members

date_h<-seq(as.POSIXlt(min(date_time),format="%d/%m/%Y %H:%M"),as.POSIXlt(max(date_time),format="%d/%m/%Y %H:%M"),by=60*60)
staff_u<-unique(staff)
comb<-expand.grid(staff_u,date_h)
colnames(comb)<-c("staff","date_time")

Uniform date format in df

df$date_time<-as.POSIXlt(df$date_time,format="%d/%m/%Y %H:%M")

Merge information

out<-merge(comb,df,all.x=T)

Your output:

head(out)
  staff           date_time reading
1  Jack 2018-03-02 00:00:00     7.5
2  Jack 2018-03-02 01:00:00     8.3
3  Jack 2018-03-02 02:00:00      NA
4  Jack 2018-03-02 03:00:00     6.9
5  Jack 2018-03-02 04:00:00      NA
6  Jack 2018-03-02 05:00:00     7.1
Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39
1

The option would be to do this separately. Create a data.table of the dates of interest and the corresponding 'staff', and get the full sequence of date time, then we rbind this with the original dataset and using a condition, we summarise the data

library(data.table)
stf <- c("Jack", "David", "Kate")
date <- as.Date(c("2018-03-02", "2018-03-03", "2018-03-04"))
df1 <- data.table(date, staff= stf)[, .(date_time = seq(as.POSIXct(paste(date, "00:00:00"), 
       tz = "GMT"),
           length.out = 24, by = "1 hour")), staff]

setDT(df)[, date_time := as.POSIXct(date_time, "%d/%m/%Y %H:%M", tz = "GMT")]
res <- rbindlist(list(df, df1), fill = TRUE)[, 
     .(reading = if(any(is.na(reading))) sum(reading, na.rm = TRUE) else reading),
         .(staff, date_time)]

table(res$staff, as.Date(res$date_time))

#         2018-03-02 2018-03-03 2018-03-04
#  David          3         24          2
#  Jack          24          1          1
#  Kate           3          1         24

head(res)
#   staff           date_time reading
#1:  Jack 2018-03-02 00:00:00     7.5
#2:  Jack 2018-03-02 01:00:00     8.3
#3:  Kate 2018-03-02 02:00:00     7.0
#4:  Jack 2018-03-02 03:00:00     6.9
#5:  Jack 2018-03-02 05:00:00     7.1
#6:  Jack 2018-03-02 06:00:00     8.1

tail(res)
#   staff           date_time reading
#1:  Kate 2018-03-04 04:00:00       0
#2:  Kate 2018-03-04 09:00:00       0
#3:  Kate 2018-03-04 13:00:00       0
#4:  Kate 2018-03-04 18:00:00       0
#5:  Kate 2018-03-04 21:00:00       0
#6:  Kate 2018-03-04 23:00:00       0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks. but the date_time appended is value like 1520200800, instead of regular date/time format. could you please have a look? – Mark K Mar 05 '18 at 12:47
  • 1
    @MarkK I think I know why it was doing that for you bcz I forgot to paste the date_time conversion line in my code. Could you please check it now? – akrun Mar 05 '18 at 12:53
  • it's ok now! you are indeed a guru! – Mark K Mar 05 '18 at 13:01
  • by the way, when the original data is big, there's no a way to manually create "stf" and "date". is there a way to generate "stf" and "date" smartly? – Mark K Mar 05 '18 at 13:01
  • 1
    @MarkK Yes, you are get it done by `unique(df$staff)` and the corresponding dates it seems to be custom made – akrun Mar 05 '18 at 13:04
  • 1
    gurus like you make the world beautiful! – Mark K Mar 05 '18 at 13:08