Create new data.table columns based on existing columns

Question

My data.table consists of hourly observations of the power produced by an engine (output) and a system state descriptor tag which tells which all components of the engine are turned on.

DATA

structure(list(time = structure(c(1517245200, 1517247000, 1517248800, 
1517250600, 1517252400, 1517254200, 1517256000, 1517257800, 1517259600, 
1517261400, 1517263200, 1517265000, 1517266800, 1517268600, 1517270400, 
1517272200, 1517274000, 1517275800, 1517277600, 1517279400, 1517281200, 
1517283000, 1517284800, 1517286600), class = c("POSIXct", "POSIXt"
), tzone = ""), output1 = c(160.03310020928, 159.706274495615, 
159.803834736236, 159.753928429527, 159.54807802046, 159.21298848298, 
158.904290018581, 158.683643772917, 158.670475839199, 158.793901799427, 
158.886487460894, 159.167829223303, 159.66751884913, 159.1288534448, 
159.141463186901, 160.116892086363, 160.517879769862, 160.615925580417, 
160.915687799509, 161.590897854561, 161.568455821241, 161.411642091721, 
161.811137570257, 162.193040254917), tag1 = c("evap only", "evap only", 
"fog & evap", "fog & evap", "evap only", "evap only", "evap only", 
"neither fog nor evap", "neither fog nor evap", "fog & evap", "evap only", "evap only", 
"evap only", "fog & evap", "evap only", "fog & evap", "evap only", 
"evap only", "evap only", "evap only", "fog & evap", "fog & evap", 
"bad data", "neither fog nor evap")), row.names = c(NA, -24L
), class = c("data.table", "data.frame"))

You can also generate some sample data using:

sample_data <- data.table(time = seq.POSIXt(from = Sys.time(), by = 60*60*3, length.out = 100), 
           output = runif(n = 100, min = 130, max = 172), 
           tag = sample(x = c('evap only', 'bad data', 'neither fog nor evap', 'fog and evap'), 
                        size = 100, replace = T))

I want to group this by day (sample data above has only two days but actual data has 3 years worth of data) and find the mean power corresponding to each tag. I would like the output to be something like:

          time  evap only fog & evap  neither fog nor evap bad data
 1: 2018-01-29  159.8391  160.0825    159.8491             161.8111

I've tried the following piece of code but the result is not in the form that I want. I'm using .SDcols because the actual dataset has a large number of other columns.

sample_data[, lapply(.SD, function(z){mean(z, na.rm = T)}), .SDcols = c('output1'), by = .(round_date(time, 'day'), tag1)]
   round_date                 tag1  output1
1: 2018-01-30            evap only 159.8391
2: 2018-01-30           fog & evap 160.0825
3: 2018-01-30 neither fog nor evap 159.8491
4: 2018-01-30             bad data 161.8111

I've seen the below questions posted on stack overflow.

Is there a data.table way of achieving this?

r u looking for `dcast(DT[, mean(output1), .(d=as.Date(time), tag1)], d ~ tag1, value.var="V1")`? since your desired output only has 1 date, its hard to tell what you are looking for — chinsoon12, Sep 30 '19 at 10:17
If you already have the mean by date isn't this just a reshaping question ? https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format — Ronak Shah, Sep 30 '19 at 10:18
@chinsoon12 data from additional dates would end up as additional rows in the output. I've added a section to generate some random data with additional dates. — Gautam, Sep 30 '19 at 12:30
@RonakShah I realized what I was missing, it works now but I'm wondering if there's a data.table way of achieving this. — Gautam, Sep 30 '19 at 12:42
Considering that you request a data.table solution, I ditto @chinsoon12. — JDG, Sep 30 '19 at 13:06

score 1 · Answer 1 · answered Sep 30 '19 at 13:10

Here is a data.table approach

#explanation of mean(.SD[[1]] ..), see akrun's comment here:
# https://stackoverflow.com/questions/29568732/using-mean-with-sd-and-sdcols-in-data-table#comment47286876_29568732
ans <- DT[, .(mean_output1 = mean(.SD[[1]], na.rm = TRUE )), 
          by = .( date = as.Date( time ), tag1 ), 
          .SDcols = c("output1") ]

dcast( ans, date~tag1, value.var = "mean_output1" )

#          date bad data evap only fog & evap neither fog nor evap
# 1: 2018-01-29       NA  159.3908   159.3701             158.6771
# 2: 2018-01-30 161.8111  160.5564   161.0323             162.1930

BetterCallMe · Answer 2 · 2019-09-30T12:40:11.317

library(dplyr)
library(lubridate)
# test is the dataframe provided in question
test1 = test %>% group_by(date = date(time), tag1) %>% 
          summarise(mean_power = mean(output1))

convert tibble produced by above code to a dataframe

test1_df = data.frame(test1)

reshape data to wide format

reshape(test1_df, idvar = "date", timevar = 
            "tag1", direction = "wide")

Output:

> output
        date evap only fog & evap bad data neither fog nor evap
1 2018-01-29  159.8697   159.8038       NA                   NA
3 2018-01-30  159.8335   160.1289 161.8111             159.8491

The row number is appearing as 3 after 1 since the date 2018-01-30 first appeared on 3rd row in test1_df.

Create new data.table columns based on existing columns

2 Answers2