1

I have the following data frame:

      Date.POSIXct       Date      WeekDay DayCategory Hour Holidays   value
1 2018-05-01 00:00:00 2018-05-01      MA    MA-MI-JU    0        0      30
2 2018-05-01 01:00:00 2018-05-01      MA    MA-MI-JU    1        0      80
3 2018-05-01 02:00:00 2018-05-01      MA    MA-MI-JU    2        0      42
4 2018-05-01 03:00:00 2018-05-01      MA    MA-MI-JU    3        0      90
5 2018-05-01 04:00:00 2018-05-01      MA    MA-MI-JU    4        0      95
6 2018-05-01 05:00:00 2018-05-01      MA    MA-MI-JU    5        0       5

DayCategory groups days of the week in the following way: Mondays goes to LU DayCategory. Tuesday, Wednesday and Thursdays go to MA-MI-JU DayCategory. Friday goes to VI, Saturdays to SA and Sundays to DO Categories respectively.

I would like to find the value for the same hour in the previous day (Date) with the same DayCategory, while Holidays remains within the same group (e.g. if one instance has holiday 0 but previous day with same DayCategory has 1, we should lookv for the previous one, etc.)

As an intermediate step and to understand the process I would like to add a column PreviousDaySameDayCategory with the Date of the previous day that has the same DayCategory that the corresponding row. Some times it will be just the same date minus seven days ("LU","VI","SA","DO") but other days it will be just one day.

Reproducible data:

library(lubridate)
Date.POSIXct <- seq(as.POSIXct("2018-05-01"), as.POSIXct("2018-05-31"), "hour")
mydf <- as.data.frame(Date.POSIXct)
mydf$Date <- as.Date(substr(as.character(mydf$Date.POSIXct),1,10))
mydf$WeekDay <- substr(toupper((weekdays(mydf$Date))),1,2)
mydf$DayCategory <-as.factor(ifelse(mydf$WeekDay == "MA" | mydf$WeekDay == "MI" | mydf$WeekDay == "JU", 
                                    "MA-MI-JU", mydf$WeekDay))
mydf$Hour <- hour(mydf$Date.POSIXct)
mydf$Holidays <- c(rep(0, 24*7),rep(1, 24*7), rep(0, 24*16+1))
set.seed(123)
mydf$myvalue <- sample.int(101,size=nrow(mydf),replace=TRUE)

I have manually started the first days and craeted a vector of how the solution should look like:

  a <- rep(NA, 24)
  b <- mydf$value[1:24]
  c <- mydf$value[25:48]
  d <- rep(NA, 24)
  e <- rep(NA,24)
  f <- rep(NA,24)
  g <- rep(NA,24)
  h <- rep(NA,24)
  i <- mydf$value[169:192]
  solution <- c(a,b,c,d,e,f,g,h,i)
  solution

I would appreciate any hint in the thinking process to solve this kind of problems that I face with relative frequency.

alvaropr
  • 699
  • 9
  • 20

2 Answers2

2

OP has posted the same question in the Data Science section as well. I am including the same solution I have there here case it might help others. It is similar to Uwe's solution but with the dplyr library instead.

library(dplyr)
rankedDf <- mydf %>%
  group_by(DayCategory, Hour, Holidays) %>%
  arrange(Date) %>%
  mutate(rowRank = order(Date), previousRowRank = order(Date) - 1) %>%
  left_join(., ., by = c("previousRowRank" = "rowRank", "DayCategory", "Hour", "Holidays")) %>%
  select(
  Date.POSIXct = Date.POSIXct.x,
  Date = Date.x,
  WeekDay = WeekDay.x,
  DayCategory,
  Hour,
  Holidays,
  myvalue = myvalue.x,
  PreviousDaySameDayCategory = Date.y,
  PreviousValueSameDayCategory = myvalue.y
)
print.data.frame(rankedDf)

P.S. love the way Uwe changes the original sample code.

The Lyrist
  • 434
  • 3
  • 13
  • Thanks the Lyrist for your answer. Maybe we should just copy your answer from Data Science Exchange Community and close the answer there, is that ok w you? I understand your reassoning there, however I have not been able to implement in in the code with satisfactory results. I will keep trying as I am more familiar with dplyr than data.table and would like to get it with both packages to practice. – alvaropr May 30 '18 at 08:58
1

Here is a data.table solution which uses a "grouped shift()" and multiple joins to copy value from the same hour of the PreviousDaySameDayCategory.

Create reproducible data

OP's code to create reproducible data was not fully reproducible because he used the weekdays() function which returns the weekday names in the current locale (which seems to be Spanish for the OP). To be independent of the current locale, I switched to format(Date, "%u") which returns the numbers 1 to 7 for Monday to Sunday. Furthermore, the fct_collapse() from the forcats package is used to collapse the days 2, 3, and 4 (Tuesday to Thursday) into one factor level.

library(data.table)
# note that package lubridate is not required
myDT <- data.table(Date.POSIXct = seq(as.POSIXct("2018-05-01"), 
                                      as.POSIXct("2018-05-31"), "hour"))
myDT[, Date := as.Date(Date.POSIXct)]
myDT[, Weekday := format(Date, "%u")]
myDT[, DayCategory := forcats::fct_collapse(Weekday, "234" = c("2", "3", "4"))]
myDT[, hour := hour(Date.POSIXct)]
myDT[, Holidays := c(rep(0, 24 * 7), rep(1, 24 * 7), rep(0, 24 * 16 + 1))]
set.seed(123)
myDT[, myvalue := sample.int(101, size = nrow(mydf), replace = TRUE)]

Intermediate step: PreviousDaySameDayCategory

The sample data set consists of hourly data but in order to determine the PreviousDaySameDayCategory we need to work day-wise and thus only have to deal with the unique values of Date, DayCategory, and Holidays. The data is grouped by DayCategory and the Holidays indicator. For each group separately, the previous day is picked by lagging Date. As the result of shift() operations depend on the order of rows the dataset has been ordered before shifting.

tmp <- unique(myDT[order(Date), .(Date, DayCategory, Holidays)])[
  , .(Date, PreviousDaySameDayCategory = shift(Date)), by = .(DayCategory, Holidays)][
    order(Date)]
tmp
    DayCategory Holidays       Date PreviousDaySameDayCategory
 1:         234        0 2018-05-01                       <NA>
 2:         234        0 2018-05-02                 2018-05-01
 3:         234        0 2018-05-03                 2018-05-02
 4:           5        0 2018-05-04                       <NA>
 5:           6        0 2018-05-05                       <NA>
 6:           7        0 2018-05-06                       <NA>
 7:           1        0 2018-05-07                       <NA>
 8:         234        1 2018-05-08                       <NA>
 9:         234        1 2018-05-09                 2018-05-08
10:         234        1 2018-05-10                 2018-05-09
11:           5        1 2018-05-11                       <NA>
12:           6        1 2018-05-12                       <NA>
13:           7        1 2018-05-13                       <NA>
14:           1        1 2018-05-14                       <NA>
15:         234        0 2018-05-15                 2018-05-03
16:         234        0 2018-05-16                 2018-05-15
17:         234        0 2018-05-17                 2018-05-16
18:           5        0 2018-05-18                 2018-05-04
19:           6        0 2018-05-19                 2018-05-05
20:           7        0 2018-05-20                 2018-05-06
21:           1        0 2018-05-21                 2018-05-07
22:         234        0 2018-05-22                 2018-05-17
23:         234        0 2018-05-23                 2018-05-22
24:         234        0 2018-05-24                 2018-05-23
25:           5        0 2018-05-25                 2018-05-18
26:           6        0 2018-05-26                 2018-05-19
27:           7        0 2018-05-27                 2018-05-20
28:           1        0 2018-05-28                 2018-05-21
29:         234        0 2018-05-29                 2018-05-24
30:         234        0 2018-05-30                 2018-05-29
31:         234        0 2018-05-31                 2018-05-30
    DayCategory Holidays       Date PreviousDaySameDayCategory

For days 3 and 4 (Wednesdays and Thursday) the preceeding Tuesday and Wednesday, resp., of the same week are picked. For day 2 (Tuesday) the preceeding Thursday of the preceeding week is picked if both weeks have the same holiday indicator set. If the preceeding week has a different holiday indicator the most recent Thursday of the same holiday period is picked. This is why, e.g., the 2018-05-03 is picked in row 15.

Copying value from matching PreviousDaySameDayCategory

This is done in two steps. First, the hourly values are picked from the matching PreviousDaySameDayCategory by joining with the matching days table tmp:

tmp2 <- myDT[tmp, on = .(Date = PreviousDaySameDayCategory), .(Date = i.Date, hour, myvalue), nomatch = 0L] 
tmp2
           Date hour myvalue
  1: 2018-05-02    0      30
  2: 2018-05-02    1      80
  3: 2018-05-02    2      42
  4: 2018-05-02    3      90
  5: 2018-05-02    4      95
 ---                        
500: 2018-05-31   19      39
501: 2018-05-31   20       1
502: 2018-05-31   21       1
503: 2018-05-31   22     101
504: 2018-05-31   23      11

Second, a new column previousValue in myDT is created by updating in a join which contains the corresponding value from PreviousDaySameDayCategory:

myDT[tmp2, on = .(Date, hour), previousValue := i.myvalue]

Here, the first two days of the result are shown:

myDT[Date %between% c(as.Date("2018-05-01"), as.Date("2018-05-02"))]
           Date.POSIXct       Date Weekday DayCategory hour Holidays myvalue previousValue
 1: 2018-05-01 00:00:00 2018-05-01       2         234    0        0      30            NA
 2: 2018-05-01 01:00:00 2018-05-01       2         234    1        0      80            NA
 3: 2018-05-01 02:00:00 2018-05-01       2         234    2        0      42            NA
 4: 2018-05-01 03:00:00 2018-05-01       2         234    3        0      90            NA
 5: 2018-05-01 04:00:00 2018-05-01       2         234    4        0      95            NA
 6: 2018-05-01 05:00:00 2018-05-01       2         234    5        0       5            NA
 7: 2018-05-01 06:00:00 2018-05-01       2         234    6        0      54            NA
 8: 2018-05-01 07:00:00 2018-05-01       2         234    7        0      91            NA
 9: 2018-05-01 08:00:00 2018-05-01       2         234    8        0      56            NA
10: 2018-05-01 09:00:00 2018-05-01       2         234    9        0      47            NA
11: 2018-05-01 10:00:00 2018-05-01       2         234   10        0      97            NA
12: 2018-05-01 11:00:00 2018-05-01       2         234   11        0      46            NA
13: 2018-05-01 12:00:00 2018-05-01       2         234   12        0      69            NA
14: 2018-05-01 13:00:00 2018-05-01       2         234   13        0      58            NA
15: 2018-05-01 14:00:00 2018-05-01       2         234   14        0      11            NA
16: 2018-05-01 15:00:00 2018-05-01       2         234   15        0      91            NA
17: 2018-05-01 16:00:00 2018-05-01       2         234   16        0      25            NA
18: 2018-05-01 17:00:00 2018-05-01       2         234   17        0       5            NA
19: 2018-05-01 18:00:00 2018-05-01       2         234   18        0      34            NA
20: 2018-05-01 19:00:00 2018-05-01       2         234   19        0      97            NA
21: 2018-05-01 20:00:00 2018-05-01       2         234   20        0      90            NA
22: 2018-05-01 21:00:00 2018-05-01       2         234   21        0      70            NA
23: 2018-05-01 22:00:00 2018-05-01       2         234   22        0      65            NA
24: 2018-05-01 23:00:00 2018-05-01       2         234   23        0     101            NA
25: 2018-05-02 00:00:00 2018-05-02       3         234    0        0      67            30
26: 2018-05-02 01:00:00 2018-05-02       3         234    1        0      72            80
27: 2018-05-02 02:00:00 2018-05-02       3         234    2        0      55            42
28: 2018-05-02 03:00:00 2018-05-02       3         234    3        0      61            90
29: 2018-05-02 04:00:00 2018-05-02       3         234    4        0      30            95
30: 2018-05-02 05:00:00 2018-05-02       3         234    5        0      15             5
31: 2018-05-02 06:00:00 2018-05-02       3         234    6        0      98            54
32: 2018-05-02 07:00:00 2018-05-02       3         234    7        0      92            91
33: 2018-05-02 08:00:00 2018-05-02       3         234    8        0      70            56
34: 2018-05-02 09:00:00 2018-05-02       3         234    9        0      81            47
35: 2018-05-02 10:00:00 2018-05-02       3         234   10        0       3            97
36: 2018-05-02 11:00:00 2018-05-02       3         234   11        0      49            46
37: 2018-05-02 12:00:00 2018-05-02       3         234   12        0      77            69
38: 2018-05-02 13:00:00 2018-05-02       3         234   13        0      22            58
39: 2018-05-02 14:00:00 2018-05-02       3         234   14        0      33            11
40: 2018-05-02 15:00:00 2018-05-02       3         234   15        0      24            91
41: 2018-05-02 16:00:00 2018-05-02       3         234   16        0      15            25
42: 2018-05-02 17:00:00 2018-05-02       3         234   17        0      42             5
43: 2018-05-02 18:00:00 2018-05-02       3         234   18        0      42            34
44: 2018-05-02 19:00:00 2018-05-02       3         234   19        0      38            97
45: 2018-05-02 20:00:00 2018-05-02       3         234   20        0      16            90
46: 2018-05-02 21:00:00 2018-05-02       3         234   21        0      15            70
47: 2018-05-02 22:00:00 2018-05-02       3         234   22        0      24            65
48: 2018-05-02 23:00:00 2018-05-02       3         234   23        0      48           101
           Date.POSIXct       Date Weekday DayCategory hour Holidays myvalue previousValue

Verification

The result is in line with OP's expectations

identical(myDT[, previousValue[seq_along(solution)]], solution)
[1] TRUE
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks a lot for the detailed explanation. I am not used to use data.table and it took me some time to figure out about the `.()` syntax and the `on =` feature but after some research I think I have understood the process. – alvaropr May 30 '18 at 08:00
  • I have checked the solution in a large dataframe with similar characteristics and seems to work perfectly. Thanks. – alvaropr May 30 '18 at 09:15
  • I have realized this answer does not consider that there are days with 23h and days with 25hours due to daylight savings time shifting in Europe. Will try to adapt it. – alvaropr Jun 04 '18 at 13:01
  • The question & sample data did not mention day light saving time shifts. Time zones and day light savings impose issues which are not restricted to Europe. Perhaps, the easiest way is to convert local times to UTC when reading the data. Then, there should be no need to adapt the code. – Uwe Jun 04 '18 at 15:03
  • Thanks! I am getting used to daylight savings programming related issues, will try to go deeper in this matter. I have found this post that might be useful: https://stackoverflow.com/questions/2532729/daylight-saving-time-and-time-zone-best-practices . – alvaropr Jun 04 '18 at 15:11