I'd like to summarize a set of observations in a datatable and could use some help with the syntax.
I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.
- observations are summarized by date
- observations date have varied counts of measurements (rows per date)
- 'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.
I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.
library(data.table)
raw <- data.table(
Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)
summarized <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c(TRUE,TRUE,TRUE,TRUE),
M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE),
M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE),
M10 = c(FALSE,FALSE,TRUE,FALSE)
)
Excel
Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).
Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.
I think I need to use a join but how to calculate the M columns escapes me.
Any help is much appreciated.
Question: is there a name for this type of operation in data science or mathematics?
Update: I'm trying the following
setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]
Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.
Date M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE
My guess is I need to use something different to subset the raw rows in the join.