Transpose and Filter Dataframe with null values in R

Question

This one is almost a challenge!

I have the following dataframe:

tag     hour                    val
N1      2013-01-01 00:00:00     0.3404266179
N1      2013-01-01 01:00:00     0.3274182995
N1      2013-01-01 02:00:00     0.3142598749
N2      2013-01-01 02:00:00     0.3189924887
N2      2013-01-01 04:00:00     0.3170907762
N3      2013-01-01 05:00:00     0.3161910788
N3      2013-01-01 06:00:00     0.4247638954

I need to transform it to something like this:

hour                    N1              N2              N3
2013-01-01 00:00:00     0.3404266179    NULL            NULL
2013-01-01 01:00:00     0.3274182995    NULL            NULL
2013-01-01 02:00:00     0.3142598749    0.3189924887    NULL
2013-01-01 03:00:00     NULL            NULL            NULL
2013-01-01 04:00:00     NULL            0.3170907762    NULL
2013-01-01 05:00:00     NULL            NULL            0.3161910788
2013-01-01 06:00:00     NULL            NULL            0.4247638954

As things are not that easy, my dataframe goes up to N5000 and hour has almost 200.000 entries for each N.

The timestamp is very well behaved, as it increases minute by minute for everybody in a way you could generate all timestamps with a simple command like strptime("2013-01-01 00:00:00", "%Y-%m-%d %H:%M:%S") + c(0:172800)*60 (172800 minutes ~ 4 months). But not necessarily you have data for every timestamp, as I show on the example.

I know I could write a function with endless loops, but is there a way to do this using only R (and its packages) functions?

Thanks!

If you have no "0"s in "val", you could see something like `xtabs(val ~ as.character(hour) + tag, DF)` and replace "0"s with `NA`? — alexis_laz, Jun 04 '14 at 18:29
Although I do have 0's in my dataframe, xtabs may do the trick with little effort, but is too slow! Cast is almost instantaneous! — Rafa Borges, Jun 04 '14 at 18:38

Chipmonkey · Accepted Answer · 2014-06-04T18:33:05.517

3

You want to use the "reshape2" package:

 install.packages("reshape2")
 library(reshape2)
 newdf <- dcast(mydata, hour~tag)

reshape2 is a wildly powerful package that I completely fail to understand... but sometimes it has nice useful things like this that just work. :-)

UPDATED: that's "dcast" not "cast"... I mistakenly used the "reshape" not the "reshape2" package. Fixed!

edited Jun 04 '14 at 18:33

answered Jun 04 '14 at 18:16

Chipmonkey

863
7
18

2

I may be wrong, but shouldn't it be `dcast` instead of `cast`? I tried to run your code, but R couldn't find the function `cast`, although `?cast` takes me to `reshape2` help page (reshape2 is installed). (It works nicely with `dcast`) – talat Jun 04 '14 at 18:29
You're totally right! You know, I think I had "reshape" installed at the same time. Editing now. – Chipmonkey Jun 04 '14 at 18:32
It is so simple that I'm embarrassed! I really should understand better what reshape is capable of. As you, most of the functions I fail to understand... @beginneR, dcast returns a dataframe. acast returns a array/vector – Rafa Borges Jun 04 '14 at 18:32

score 2 · Answer 2 · answered Jun 04 '14 at 18:03

This is neither the most straightforward nor elegant solution, but it works:

An exemplary data.frame:

df <- data.frame(tag=rep(c("N1", "N2", "N4"), c(3,2,2)),
                 hour=structure(c(1,2,3,3,5,6,7), class="POSIXct"),
                 val=runif(7))
##   tag                hour       val
## 1  N1 1970-01-01 01:00:01 0.6645598
## 2  N1 1970-01-01 01:00:02 0.7924186
## 3  N1 1970-01-01 01:00:03 0.3813311
## 4  N2 1970-01-01 01:00:03 0.8555780
## 5  N2 1970-01-01 01:00:05 0.4480540
## 6  N4 1970-01-01 01:00:06 0.1875233
## 7  N4 1970-01-01 01:00:07 0.5755332

Now we create the resulting date column (it's just an example):

uh <- structure(1:7, class="POSIXct") # or e.g. uh <- unique(df$hour), or seq(), etc.

Then we create an "empty" resulting data frame (each val will be NA)

nr <- length(uh) # number of rows on out
# column definitions:
(coldef <- paste("hour=uh", paste(unique(df$tag), "NA_real_", sep="=", collapse=", "), sep=", "))
## [1] "hour=uh, N1=NA_real_, N2=NA_real_, N4=NA_real_"
# create output df:
outdf <- eval(parse(text=sprintf("data.frame(list(%s))", coldef)))

Finally, let's set vals in each N* column:

for (idx in split(1:nrow(df), df$tag))
   outdf[outdf$hour %in% df$hour[idx], as.character(df$tag[idx[1]])] <- df$val[idx]

score 2 · Answer 3 · answered Jun 04 '14 at 18:35

You might also consider the base function reshape if you don't want to bother with another package. Using @gagolews's sample data

> reshape(df, idvar="hour", timevar="tag", v.names="val", direction="wide")
                 hour    val.N1    val.N2    val.N4
1 1969-12-31 19:00:01 0.8156553        NA        NA
2 1969-12-31 19:00:02 0.9203821        NA        NA
3 1969-12-31 19:00:03 0.8127614 0.7386737        NA
5 1969-12-31 19:00:05        NA 0.9648562        NA
6 1969-12-31 19:00:06        NA        NA 0.2540216
7 1969-12-31 19:00:07        NA        NA 0.5024042

For a small dataframe it works flawlessly, but for my actual df (1M rows), it crashes due to - apparently - memory management. — Rafa Borges, Jun 04 '14 at 18:53

Transpose and Filter Dataframe with null values in R

3 Answers3

Linked