2

I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.

I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.

In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.

library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
               character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])

#option 1

casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])


# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]

# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)

#option2

# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
    list(...))
}


casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
      time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])

# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
  lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]

# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)

Any advice/improvement in the process is welcome.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Jon Nagra
  • 1,538
  • 1
  • 16
  • 36

1 Answers1

3

I'd maybe do the char and num tables separately and then rbind:

k        = "time"
typecols = c("numeric", "character")

res = rbindlist(fill = TRUE, 
  lapply(typecols, function(tc){
    cols = c(k, tc, "variable")
    dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
  })
)

setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]

which gives

   time a  b  c  d  e  f
1:    1 1 NA NA NA NA NA
2:    2 1  2 NA NA NA NA
3:    3 1  2  3 NA NA NA
4:    4 1  2  3  a NA NA
5:    5 1  2  3  a  b NA
6:    6 1  2  3  a  b  c

Note that OP's final result casted2, differs in that it has all cols as char.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    You are right about casted2, there is some strange behaviour there, if I run the line with casted2[,eval(numericvars):=... twice though the type is properly converted. I don't know why this behaviour happens, should I open a question on that or file a bug? That aside, your solution is much more elegant than mine. I think there may be some duplicates in the real dataset when a character and a numeric happen at the same time but it will be easy to handle that from this point. Thanks a lot – Jon Nagra Feb 17 '17 at 17:38