Selecting elements from a dataframe

Question

I have a very large dataset of electricity meter data. It looks like this (with a row for every 5 minute period):

df
                 Time Type    Energy
1 2019-08-31 23:55:00    a   -0.3558
2 2019-08-30 14:55:00    b   -3.1189
3 2019-08-29 15:15:00    c   27.3856
4 2019-08-28 19:20:00    b -155.7758
5 2019-08-27 18:30:00    a -149.3617

There's also a tariff table that looks like this:

     tf
       Hour      0      1      2      3      4      5      6
    1     0 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495
    2     1 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495
    3     2 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495
    4     3 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495 0.1495
...............................

and so on up to 23 hours, with days as column headings. I need to calculate energy cost for each 5 minute interval in the dataset. It seems one could use:

require(lubridate)
xlt <- as.POSIXlt(df$Time)
cost <- tf[xlt$hour,xlt$wday]* df$Energy

but this doesn't work because all days are selected for each hour, whereas what is wanted is a single element in the tariff table indexed by hour and day. Here is the result:

cost
             4           3           2           1           0
23  -0.0531921  -0.0531921  -0.0531921  -0.0531921  -0.0531921
14  -0.7157876  -0.7157876  -0.7157876  -0.7157876  -0.7157876
15   6.2849952   6.2849952   6.2849952   6.2849952   6.2849952
19 -83.9631562 -83.9631562 -83.9631562 -83.9631562 -35.7505461
18 -80.5059563 -80.5059563 -80.5059563 -80.5059563 -34.2785102

    dput(cost)
        structure(list(`4` = c(-0.0531921, -0.71578755, 6.2849952, -83.9631562, 
        -80.5059563), `3` = c(-0.0531921, -0.71578755, 6.2849952, -83.9631562, 
        -80.5059563), `2` = c(-0.0531921, -0.71578755, 6.2849952, -83.9631562, 
        -80.5059563), `1` = c(-0.0531921, -0.71578755, 6.2849952, -83.9631562, 
        -80.5059563), `0` = c(-0.0531921, -0.71578755, 6.2849952, -35.7505461, 
        -34.27851015)), class = "data.frame", row.names = c(23L, 14L, 
        15L, 19L, 18L))

Here is the data:

    dput(head(tf))
    structure(list(Hour = 0:5, `0` = c(0.1495, 0.1495, 0.1495, 0.1495, 
    0.1495, 0.1495), `1` = c(0.1495, 0.1495, 0.1495, 0.1495, 0.1495, 
    0.1495), `2` = c(0.1495, 0.1495, 0.1495, 0.1495, 0.1495, 0.1495
    ), `3` = c(0.1495, 0.1495, 0.1495, 0.1495, 0.1495, 0.1495), `4` = c(0.1495, 
    0.1495, 0.1495, 0.1495, 0.1495, 0.1495), `5` = c(0.1495, 0.1495, 
    0.1495, 0.1495, 0.1495, 0.1495), `6` = c(0.1495, 0.1495, 0.1495, 
    0.1495, 0.1495, 0.1495)), row.names = c(NA, 6L), class = "data.frame")
 dput(df)
structure(list(Time = structure(list(sec = c(0, 0, 0, 0, 0), 
    min = c(55L, 55L, 15L, 20L, 30L), hour = c(23L, 14L, 15L, 
    19L, 18L), mday = 31:27, mon = c(7L, 7L, 7L, 7L, 7L), year = c(119L, 
    119L, 119L, 119L, 119L), wday = 6:2, yday = 242:238, isdst = c(0L, 
    0L, 0L, 0L, 0L), zone = c("AEST", "AEST", "AEST", "AEST", 
    "AEST"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_)), class = c("POSIXlt", "POSIXt"
)), Type = structure(c(1L, 2L, 3L, 2L, 1L), .Label = c("a", "b", 
"c"), class = "factor"), Energy = c(-0.3558, -3.1189, 27.3856, 
-155.7758, -149.3617)), row.names = c(NA, -5L), class = "data.frame")

Performance is also an issue. I apologise if this is terribly ignorant, but I have been searching for days and not found a comparable example.

Hi @Ronald, can you update this using `dput` and add expected output, see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Bulat, Sep 20 '19 at 07:05
So you must be having one row for each hour in `tf` so 24 rows ? What do columns represent? And the final output you expect is a new column in `df` with Energy cost calculated for that duration of 5 mins based on hour of `Time` column? Can you show output for first 6 rows of `df` ? — Ronak Shah, Sep 20 '19 at 09:12
Yes, that’s right, Ronak. The columns are days of the week, 0-6. The cost column is displayed above as a separate variable, but will be added as an extra column to df. It is meant to contain the cost for the 5 min period exactly as you describe. The reason the days are needed is that the tariff varies by both day of week and hour — Ronald, Sep 20 '19 at 09:45

score 0 · Accepted Answer · answered Sep 20 '19 at 09:56

Try to create a matrix using cbind with row and column index and then use it to extract tariff from tf which is later multiplied to Energy.

df$cost <- df$Energy * tf[cbind(df$Time$hour + 1,  df$Time$wday + 1)]

Adding 1 in row index since the hours start from 0 but row number start from 1 and adding 1 in column index because we want to extract the values from 2nd column onwards (as 1st column is Hour).

Selecting elements from a dataframe

1 Answers1