3

I have a wide zoo with two variables per individuals (maybe this is a double-wide?), that I would like to convert to a long panel with two columns (one for each variables) so that I can run panel regressions (say with the plm package).

This is trivial with melt and cast from the reshape package, but with the size of these data I run into memory and speed issues. So I'd like to use all base R, but I am having a hard time with the base reshape function. From the (double-)wide zoo object I can get the "melted" long data frame with stack, but I'm afraid that this will fail silently and would prefer the precision of reshape, but I can't get the syntax correct.

Is there a reshape equivalent of stack? The only syntax that I can get to run still leaves me with a (single-)wide data frame with individuals in columns and variables rows?

Below is the relevant code block in case I'm totally off base, but the highlighted block is where I'm stuck and would like to replace stack with reshape.

# similar data
library(zoo)
dates <- seq(as.Date("2004-01-01"), by=7, len=52*10)
tickers <- letters
my.df <- data.frame(dates=as.Date(rep_len(dates, length(dates)*length(tickers))),
                    tickers=rep(tickers, each=length(dates)),
                    SVI=runif(length(dates)*length(tickers), min=0, max=100)) 
svi <- read.zoo(my.df, index.column="dates", split="tickers")
meansvi <- rollmean(svi, 52, align="right")
my.zoo <- merge(svi, meansvi)

############################################################
# I need help here -- what is the base::reshape equivalent?
############################################################

# reshape to long data frame
my.df2 <- as.data.frame(my.zoo)
my.df2$dates <- index(my.zoo)
my.df3 <- stack(my.df2, select=-dates)
my.df3$dates <- as.Date(rep_len(index(my.zoo), nrow(my.df3)))

# this is close, but still "single-wide"
my.bad <- reshape(my.df2, 
                  varying=setdiff(names(my.df2), "dates"), 
                  direction="long")

############################################################
# end help region
############################################################

# split "ind" to recover ticker and variable
my.df3$ind <- sub("\\.", " ", my.df3$ind)
inds.list <- strsplit(my.df3$ind, " ")
inds.df <- do.call(rbind.data.frame, inds.list)
names(inds.df) <- c("tickers", "variables")
my.df3 <- cbind(my.df3, inds.df)
my.df3$ind <- NULL

# reshape long data frame with variables in columns
my.df4 <- reshape(my.df3, timevar="variables", idvar=c("dates", "tickers"), direction="wide")
names(my.df4) <- sub("values.", "", names(my.df4))

Edit -- more details.

The output of my stack call has the correct form, but I would like to produce it with base reshape. Here's the head of my.df3, which is the "melted" long version of my data (i.e., date-individual-variables pairs in rows).

> head(my.df3)
    values      dates tickers variables
1 16.32171 2004-01-01       a       svi
2 30.77852 2004-01-08       a       svi
3 14.75164 2004-01-15       a       svi
4 49.77205 2004-01-22       a       svi
5 59.05023 2004-01-29       a       svi
6 55.86036 2004-02-05       a       svi

The reshape call that I show above won't produce the same output as stack and leaves my data as a wide data frame, but with a row for each date-variable combination. This is my.bad in the code above.

> my.bad[1:5, 1:5]
           dates time        a         b        c
1.svi 2004-01-01  svi 16.32171 91.314077 93.19167
2.svi 2004-01-08  svi 30.77852 17.907831 61.73727
3.svi 2004-01-15  svi 14.75164 13.730005 53.04939
4.svi 2004-01-22  svi 49.77205  5.987645 79.19428
5.svi 2004-01-29  svi 59.05023 38.226002 88.26567

I can achieve the overall desired output with the code above, it's just the intermediate step that I would like to replace with reshape instead of stack. Here is the ultimate desired output, which is long with one column per variables with date-individual pairs in the rows.

> head(my.df4)
       dates tickers      svi meansvi
1 2004-01-01       a 16.32171      NA
2 2004-01-08       a 30.77852      NA
3 2004-01-15       a 14.75164      NA
4 2004-01-22       a 49.77205      NA
5 2004-01-29       a 59.05023      NA
6 2004-02-05       a 55.86036      NA

Maybe there's a way to completely skip my concern about stack failing silently?

Richard Herron
  • 9,760
  • 12
  • 69
  • 116

2 Answers2

3

If you are looking for performance , I think you can can't gain a lot from using base reshape. But you can try to use melt.data.table in development version(1.8.11).

## create dt object from zoo
dt <- data.table(index=index(my.zoo),coredata(my.zoo),key='index')
## fast melt and split variable column
melt(dt,id='index')[,c("tickers","variables","variable") := { 
  vv <- unlist(strsplit(as.character(variable),'.',fixed=TRUE))
  list(unlist(vv)[c(T,F)],unlist(vv)[c(F,T)],NULL) 
}]

         index    value tickers variables
    1: 2004-01-01 65.05146       a       svi
    2: 2004-01-08 52.25760       a       svi
    3: 2004-01-15 24.71477       a       svi
    4: 2004-01-22 94.28517       a       svi
    5: 2004-01-29 11.46672       a       svi
   ---                                      
27036: 2013-11-14 54.44904       z   meansvi
27037: 2013-11-21 54.20244       z   meansvi
27038: 2013-11-28 53.82270       z   meansvi
27039: 2013-12-05 52.66165       z   meansvi
27040: 2013-12-12 50.87973       z   meansvi
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    If I understand the question correctly, you'll now need to add `dcast.data.table(out, index + variables ~ tickers, value.var = "value")` to get to the output the OP is looking for (where "out" is the result of your "melt" step). – A5C1D2H2I1M1N2O1R2T1 Dec 16 '13 at 07:20
  • Thanks for the pointer to `data.table`. I should make the jump. And it makes sense that `reshape` shouldn't be faster than `stack`. I am just looking for something more universal than `stack`. – Richard Herron Dec 16 '13 at 10:15
  • @AnandaMahto exactly. my answer was is incomplete ( I am lazy)since OP is used to use `melt+dcast`, I just show him that the same methodology exist with `data.table` and also to show him how to transform a zoo object to a data.frame. – agstudy Dec 16 '13 at 12:01
3

The first step suggested by @agstudy (bringing the "index" as a column) would be required (as far as I can tell). I would also suggest some renaming of your columns (for convenience mostly).

library(splitstackshape) # also loads the "data.table" package
dt <- data.table(index=index(my.zoo),coredata(my.zoo),key='index')
setnames(dt, gsub("([a-z].*)\\.([a-z].*)", "\\2_\\1", names(dt)))
names(dt)
#  [1] "index"     "svi_a"     "svi_b"     "svi_c"     "svi_d"     "svi_e"     "svi_f"     "svi_g"    
#  [9] "svi_h"     "svi_i"     "svi_j"     "svi_k"     "svi_l"     "svi_m"     "svi_n"     "svi_o"    
# [17] "svi_p"     "svi_q"     "svi_r"     "svi_s"     "svi_t"     "svi_u"     "svi_v"     "svi_w"    
# [25] "svi_x"     "svi_y"     "svi_z"     "meansvi_a" "meansvi_b" "meansvi_c" "meansvi_d" "meansvi_e"
# [33] "meansvi_f" "meansvi_g" "meansvi_h" "meansvi_i" "meansvi_j" "meansvi_k" "meansvi_l" "meansvi_m"
# [41] "meansvi_n" "meansvi_o" "meansvi_p" "meansvi_q" "meansvi_r" "meansvi_s" "meansvi_t" "meansvi_u"
# [49] "meansvi_v" "meansvi_w" "meansvi_x" "meansvi_y" "meansvi_z"

From there, it's a straightforward reshape problem:

reshape(dt, idvar="index", varying=2:ncol(dt), sep="_", direction = "long")
#             index time       svi  meansvi
#     1: 2004-01-01    a 42.037201       NA
#     2: 2004-01-08    a 26.178651       NA
#     3: 2004-01-15    a 22.918586       NA
#     4: 2004-01-22    a 32.408017       NA
#     5: 2004-01-29    a  8.014803       NA
#    ---                                   
# 13516: 2013-11-14    z 18.720031 51.98170
# 13517: 2013-11-21    z 40.362254 52.43208
# 13518: 2013-11-28    z 45.242975 53.18101
# 13519: 2013-12-05    z  9.951862 53.11564
# 13520: 2013-12-12    z 15.520532 52.41540

Or, sticking with "splitstackshape", you can use merged.stack (which will also be faster than reshape):

merged.stack(dt, id.vars="index", var.stubs=c("^svi", "^meansvi"), sep = "_")
#             index .time_1      ^svi ^meansvi
#     1: 2004-01-01       a 42.037201       NA
#     2: 2004-01-01       b  8.121128       NA
#     3: 2004-01-01       c 65.798174       NA
#     4: 2004-01-01       d 31.805462       NA
#     5: 2004-01-01       e  8.002793       NA
#    ---                                      
# 13516: 2013-12-12       v 68.643179 44.87272
# 13517: 2013-12-12       w 48.169667 43.63862
# 13518: 2013-12-12       x 58.160491 53.72554
# 13519: 2013-12-12       y 72.259193 47.61617
# 13520: 2013-12-12       z 15.520532 52.41540

Strictly base R reshape:

You don't have to rename your data to use reshape, but there's a lot more manual work involved if you don't (which I find more likely to introduce errors). Try:

df <- data.frame(index = index(my.zoo), coredata(my.zoo))
reshape(df, direction = "long", idvar="index", varying=2:ncol(df), 
        v.names=c("svi", "meansvi"), times = letters)

I've posted some benchmarks here. The summary is that @agstudy leads the way with the speed:

## Unit: milliseconds
##       expr   min     lq median     uq   max neval
##     AM_1() 229.2 236.47 244.66 255.91 334.1    20
##     AM_2()  84.0  86.69  90.08  93.60 176.0    20
##  agstudy()  52.5  52.95  56.09  60.36 133.5    20
##     orig() 268.7 324.58 350.45 440.32 588.6    20

I'm probably biased, but I'm pretty happy with the results from my merged.stack function.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Is there any reference (other than the help page) for base `reshape`? I typically use `melt` and `cast` from Hadley's `reshape` because I find the base `reshape` help page cryptic, but in this usage I run into memory and speed issues. – Richard Herron Dec 16 '13 at 09:53
  • 2
    @RichardHerron, in which usage? The usage described above in this answer? I agree that the `reshape` help is cryptic. It's one of those functions that works best when your input data is in the shape it expects to begin with, which is a bit ironic.... That said, I do answer a lot of `reshape` related questions, so let me know where you're getting hung up and I'll be happy to try to help. – A5C1D2H2I1M1N2O1R2T1 Dec 16 '13 at 10:02