2

I'd like to summarize a set of observations in a datatable and could use some help with the syntax.

I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.

  • observations are summarized by date
  • observations date have varied counts of measurements (rows per date)
  • 'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.

I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.

library(data.table)
raw <- data.table(
  Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
  S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
  S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
  S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)

summarized <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  M1 = c(FALSE,TRUE,TRUE,TRUE),
  M2 = c(TRUE,FALSE,TRUE,FALSE),
  M3 = c(TRUE,TRUE,TRUE,TRUE),
  M4 = c(TRUE,FALSE,FALSE,TRUE),
  M5 = c(TRUE,FALSE,FALSE,FALSE),
  M6 = c(TRUE,FALSE,TRUE,FALSE),
  M7 = c(FALSE,TRUE,TRUE,FALSE),
  M8 = c(FALSE,FALSE,TRUE,FALSE),
  M9 = c(FALSE,FALSE,TRUE,FALSE),
  M10 = c(FALSE,FALSE,TRUE,FALSE)
)

Excel image showing the relationship between the sets of data Excel

Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).

Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.

I think I need to use a join but how to calculate the M columns escapes me.

Any help is much appreciated.

Question: is there a name for this type of operation in data science or mathematics?

Update: I'm trying the following

setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]

Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.

         Date   M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE

My guess is I need to use something different to subset the raw rows in the join.

Blue Magister
  • 13,044
  • 5
  • 38
  • 56
eAndy
  • 323
  • 2
  • 9
  • I think I'd call it "making dummies"...you know, dummy variables http://en.wikipedia.org/wiki/Dummy_variable_(statistics) although that's a pretty vague description. – Frank Sep 18 '13 at 14:27

2 Answers2

3

This seems to work:

raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]

The result is

         Date    V1    V2   V3    V4    V5    V6    V7    V8    V9   V10
1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

If you want the columns with "M" instead of "V", use c(M=1:10) in place of 1:10.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

This is a reshaping problem.

First, since it doesn't matter which sensor the data came from, let's aggregate your three columns into one column.

temp <- raw[,Reduce(union,list(S1,S2,S3)),by=Date]

Now we want to reshape from "long" to "wide" format.

A data table solution borrowed from this answer:

setkey(temp,Date,V1)
temp[CJ(unique(Date),unique(V1)), list(.N)][,
  setNames(as.list(as.logical(N)), paste0("M",unique(V1))), by = Date]
         # Date    M1    M2   M3    M4    M5    M6    M7    M8    M9
# 1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Base reshape works this way:

as.data.table(reshape(temp, timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
         # Date V1.4 V1.2 V1.3 V1.5 V1.6 V1.1 V1.7 V1.8 V1.9
# 1: 2013-05-04    4    2    3    5    6   NA   NA   NA   NA
# 2: 2013-05-09    4   NA    3   NA   NA    1    7   NA   NA
# 3: 2013-05-16   NA    2    3   NA    6    1    7    8    9
# 4: 2013-05-30    4   NA    3   NA   NA    1   NA   NA   NA

## to order by column
temp2 <- as.data.table(reshape(temp[order(V1)], timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
         # Date V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09    1   NA    3    4   NA   NA    7   NA   NA
# 2: 2013-05-16    1    2    3   NA   NA    6    7    8    9
# 3: 2013-05-30    1   NA    3    4   NA   NA   NA   NA   NA
# 4: 2013-05-04   NA    2    3    4    5    6   NA   NA   NA

##converts to logical true/false
temp2[,lapply(.SD,function(x) {x[is.na(x)] <- 0; as.logical(x)}), by = Date]
       # Date   vv  V1.1  V1.2 V1.3  V1.4  V1.5  V1.6  V1.7  V1.8  V1.9
# 1: 2013-05-09 TRUE  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 2: 2013-05-16 TRUE  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 3: 2013-05-30 TRUE  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
# 4: 2013-05-04 TRUE FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

The package reshape2 is a bit more intuitive:

require(reshape2)
## dummy variable for TRUE/FALSE
temp[,vv := TRUE]
temp_reshape2 <- as.data.table(dcast(temp, Date ~ V1, value.var = "vv"))
## replace NA with FALSE
temp_reshape2[, lapply(.SD, function(x) {x[is.na(x)] <- FALSE; x}), by = Date]
         # Date     1     2    3     4     5     6     7     8     9
# 1: 2013-05-04 FALSE  TRUE TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 2: 2013-05-09  TRUE FALSE TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
# 3: 2013-05-16  TRUE  TRUE TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# 4: 2013-05-30  TRUE FALSE TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

And for completion, a lame parse-eval solution:

limits <- temp[,c(min(V1),max(V1))]
sapply(temp[,min(V1) : max(V1)], function(x) {
temp[,eval(parse(text=paste0("M",x," := any(abs(V1 - ",x,") < .Machine$double.eps)"))),by = Date]
})
Community
  • 1
  • 1
Blue Magister
  • 13,044
  • 5
  • 38
  • 56