0

I am working with smart meter data which is in half-hourly resolution. Due to the sheer volume of data I am trying to reduce from half-hourly resolution to hourly resolution. In doing so I am attempting to sum the consumption between two half-hourly measurements. The issue is I also have catagorical data in my data frame which I lose when using xts. This is what my data looks like:

> head(test1)
      LCLid stdorToU            DateTime KWH.hh..per.half.hour.   Acorn Acorn_grouped
1 MAC000002      Std 2012-10-12 00:30:00                      0 ACORN-A      Affluent
2 MAC000002      Std 2012-10-12 01:00:00                      0 ACORN-A      Affluent
3 MAC000002      Std 2012-10-12 01:30:00                      0 ACORN-A      Affluent
4 MAC000002      Std 2012-10-12 02:00:00                      0 ACORN-A      Affluent
5 MAC000002      Std 2012-10-12 02:30:00                      0 ACORN-A      Affluent
6 MAC000002      Std 2012-10-12 03:00:00                      0 ACORN-A      Affluent

Here is the code I have been attempting to use and the result I get.

test1 <- read.csv("test.csv", stringsAsFactors = F)
test1$DateTime <- ymd_hms(test1$DateTime)
test1$KWH.hh..per.half.hour. <- as.numeric(test1$KWH.hh..per.half.hour.)
test2 <- xts(test1$KWH.hh..per.half.hour., test1$DateTime)
head(test2)
period.apply(test2, endpoints(test2, "hours"), sum)

> period.apply(test2, endpoints(test2, "hours"), sum)
                     [,1]
2012-10-12 00:30:00 0.000
2012-10-12 01:30:00 0.000
2012-10-12 02:30:00 0.000
2012-10-12 03:30:00 0.000
2012-10-12 04:30:00 0.000
2012-10-12 05:30:00 0.000
2012-10-12 06:30:00 0.000
2012-10-12 07:30:00 0.000
2012-10-12 08:30:00 0.000
2012-10-12 09:30:00 0.000
2012-10-12 10:30:00 0.000

Ideally, I need a data set exactly as my original (test1), just half the size aggregated to hourly frequency rather than half-hourly. Can someone please help.

Thanks

ojp
  • 973
  • 1
  • 11
  • 26

2 Answers2

3

You need to create a grouping column, and then sum by group.

# create grouped column
test1$grouped_time = lubridate::floor_date(test1$DateTime, unit = "hour")
# (use ceiling_date instead if you want to round the half hours up instead of down)

# sum by group
library(dplyr)
test2 = test1 %>%
  group_by(grouped_time, LCLid, stdorToU, Acorn, Acorn_grouped) %>%
  summarize(KWH.hh.per.hour = sum(KWH.hh..per.half.hour.))

There are many alternatives to dplyr at the Sum by Group R-FAQ, in case you want to look at more options.

Note that this will sum the KWH column for each unique combination of the other columns in group_by(). If some of those can change, like if stdorToU or the ACORN values might change from an hour to the next half hour but you still want the rows combined, you need to move that column out of group_by and into summarize, and specify which value to keep, e.g.

# if ACORN can change and you want to keep the first one
test2 = test1 %>%
  group_by(grouped_time, LCLid, stdorToU, Acorn_grouped) %>%
  summarize(KWH.hh.per.hour = sum(KWH.hh..per.half.hour.),
            ACORN = first(ACORN))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Gregor, worked fine, although interestingly for every hour that is 00:00:00 this time doesnt exist in the data frame, just the date. Almost like R sees that time as zero. This wasnt the case before running "grouped_by". Any thought's why? – ojp Dec 18 '19 at 19:43
  • Huh, I can't replicate that. If you share a little bit of your data using `dput`, like `dput(droplevels(test1[1:10, ]))` I can hopefully see what you're seeing. (Pick a different subset than the first 10 rows to make sure a `00:00:00` example is included) – Gregor Thomas Dec 18 '19 at 19:55
  • I have commented in an answer below (sorry I wasn't sure where to add such a long response). The variable name has changed since after getting it to work I used the whole data set rather than the smaller test set – ojp Dec 18 '19 at 20:10
  • It's good practice to include some `dput` in your question, that's the best place for it. – Gregor Thomas Dec 18 '19 at 20:13
  • But, I think the answer is *it's just printing behavior*. With `x = c(1, 2, 3.5)`, if you print a subset of `x` that has only integers, they will print without a decimal: `x[1:2]` gives `1 2`. But if you print a subset that has a decimal, they all print with decimals, `x[2:3]` gives `2.0 3.5`. Same thing happens with the datetimes - if you print a subset that is all `00:00:00` in the time component, it will be omitted. But if any of the datetimes has a non-zero time component, they will all be printed. – Gregor Thomas Dec 18 '19 at 20:18
  • 1
    Thanks Gregor. I am new to datascience so great to have people like you clear things like this up for me. – ojp Dec 18 '19 at 20:21
0
> head(sm_2013_tof)
# A tibble: 6 x 6
# Groups:   grouped_time, LCLid, stdorToU, Acorn [6]
  grouped_time        LCLid     stdorToU Acorn   Acorn_grouped KWH.hh.per.hour
  <dttm>              <chr>     <chr>    <chr>   <chr>                   <dbl>
1 2013-01-01 00:00:00 MAC000146 ToU      ACORN-L Adversity               0.155
2 2013-01-01 00:00:00 MAC000147 ToU      ACORN-F Comfortable             0.276
3 2013-01-01 00:00:00 MAC000158 ToU      ACORN-H Comfortable             0.152
4 2013-01-01 00:00:00 MAC000165 ToU      ACORN-E Affluent                0.401
5 2013-01-01 00:00:00 MAC000170 ToU      ACORN-F Comfortable             0.64 
6 2013-01-01 00:00:00 MAC000173 ToU      ACORN-E Affluent                0.072
> 

here is the now hourly data after grouping.

If i make this as.data.frame you see the 00:00:00 disappears

sm_short_2013 <- as.data.frame(sm_2013_tof)

> head(sm_short_2013)
  grouped_time     LCLid stdorToU   Acorn Acorn_grouped KWH.hh.per.hour
1   2013-01-01 MAC000146      ToU ACORN-L     Adversity           0.155
2   2013-01-01 MAC000147      ToU ACORN-F   Comfortable           0.276
3   2013-01-01 MAC000158      ToU ACORN-H   Comfortable           0.152
4   2013-01-01 MAC000165      ToU ACORN-E      Affluent           0.401
5   2013-01-01 MAC000170      ToU ACORN-F   Comfortable           0.640
6   2013-01-01 MAC000173      ToU ACORN-E      Affluent           0.072
> dput(droplevels(sm_short_2013[1:10, ]))
structure(list(grouped_time = structure(c(1356998400, 1356998400, 
1356998400, 1356998400, 1356998400, 1356998400, 1356998400, 1356998400, 
1356998400, 1356998400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    LCLid = c("MAC000146", "MAC000147", "MAC000158", "MAC000165", 
    "MAC000170", "MAC000173", "MAC000186", "MAC000187", "MAC000193", 
    "MAC000194"), stdorToU = c("ToU", "ToU", "ToU", "ToU", "ToU", 
    "ToU", "ToU", "ToU", "ToU", "ToU"), Acorn = c("ACORN-L", 
    "ACORN-F", "ACORN-H", "ACORN-E", "ACORN-F", "ACORN-E", "ACORN-E", 
    "ACORN-L", "ACORN-D", "ACORN-D"), Acorn_grouped = c("Adversity", 
    "Comfortable", "Comfortable", "Affluent", "Comfortable", 
    "Affluent", "Affluent", "Adversity", "Affluent", "Affluent"
    ), KWH.hh.per.hour = c(0.155, 0.276, 0.152, 0.401, 0.64, 
    0.072, 0.407, 0.554, 0.725, 0.158)), row.names = c(NA, 10L
), class = "data.frame")

ojp
  • 973
  • 1
  • 11
  • 26