1

I need to perform a simple reshape of data from long to wide, and this needs to work in base R. For this use case, reshape() seems to be extraordinarily slow (despite assertions that it is very fast https://stackoverflow.com/a/12073077/3017280). This example is a reasonable approximation of my data. I know that in this example I do not need both Index columns, but I do in the real data. On my laptop 10,000 rows takes 3 seconds, and 40,000 rows takes over 200 seconds. The real data has over one million rows, so reshape() is obviously a non-starter. Can anyone shed any light on why it takes so long in this case? I worked around the problem using split / lapply / Reduce + merge, which is clumsy but very much quicker.

n <- 5000
dfLong <- data.frame(Index1 = rep(sample(1E6:2E6, n), 4),
                  Index2 = rep(sample(3E6:4E6, n), 4),
                  Key = rep(1:4, each = n),
                  Date = sample(seq.Date(as.Date("2020-01-01"),
                                         as.Date("2021-12-31"), 
                                         by = "1 day"),
                                size = n * 4, replace = TRUE),
                  Score = sample(0:48, n * 4, replace = TRUE))
                                
system.time(dfWide <- reshape(data = dfLong,
          v.names = c("Date", "Score"),
          timevar = "Key",
          idvar = c("Index1", "Index2"),
          sep = "_Q",
          direction = "wide"))
 
Knackiedoo
  • 502
  • 3
  • 8
  • 1
    Pity, tidyr::pivot_wider works on your sample data in <1 sec with n = 1E6. I'm also curious why reshape is so slow here. – Jon Spring Jan 03 '22 at 20:48
  • 2
    Perhaps relevant look at comparative performance for reshaping, from 2019 so likely somewhat out of date: https://stackoverflow.com/questions/55077668/what-methods-can-we-use-to-reshape-very-large-data-sets – Jon Spring Jan 03 '22 at 20:51
  • 1
    I don't know that my answer addresses what you are asking, but to answer "why" would take a lot more diving and may not provide anything actionable (except perhaps a regression test for R-core). One could always profile the operation and see where most of the time is spent (an exercise not in the cards for me). Subjectively, I've always found `data.table::dcast` and `::melt` to be better than `reshape(.)`, and `tidyr::pivot_*` functions to be the most featureful and readable. – r2evans Jan 03 '22 at 20:59
  • You might also take a look here too with some discussion on using `tapply`: https://stackoverflow.com/q/5307313/15293191 – AndrewGB Jan 03 '22 at 21:04

3 Answers3

3

I don't know that I've ever made the claim that stats::reshape is the fastest.

For comparisons, stats::reshape is not as fast on my i9/64GB-ram system:

system.time(
dfWide <- reshape(data = dfLong,
          v.names = c("Date", "Score"),
          timevar = "Key",
          idvar = c("Index1", "Index2"),
          sep = "_Q",
          direction = "wide")
)
#    user  system elapsed 
#   19.63    0.03   19.73 

But other reshaping functions do much better:

system.time(
  tidyrWide <- pivot_wider(
    dfLong, c("Index1", "Index2"),
    names_prefix = "Q", names_from = "Key",
    values_from = c("Date", "Score"))
)
#    user  system elapsed 
#    0.01    0.00    0.02 

nms <- names(dfWide)
tidyrWide <- subset(tidyrWide, select = nms) # column order
dfOrder <- do.call(order, dfWide)
tidyrOrder <- do.call(order, tidyrWide)
all.equal(dfWide[dfOrder,], as.data.frame(tidyrWide)[tidyrOrder,], check.attributes = FALSE)
# [1] TRUE

Similarly, data.table::dcast is equally fast:

dtLong <- as.data.table(dfLong)
system.time(
  dtWide <- data.table::dcast(
    Index1 + Index2 ~ paste0("Q", Key),
    data = dtLong, value.var = c("Date", "Score"))
)
#    user  system elapsed 
#    0.00    0.01    0.02 

dtWide <- subset(dtWide, select = nms) # column order
dtOrder <- do.call(order, dtWide)
all.equal(dfWide[dfOrder,nms], as.data.frame(dtWide)[dtOrder,nms], check.attributes = FALSE)
# [1] TRUE
r2evans
  • 141,215
  • 6
  • 77
  • 149
3

If you look at what functions reshape calls with the profvis package, you can see that almost all of the total time spent is on this one line in the function. The interaction function is used only to combine your two id columns into a single column.

data[, tempidname] <- interaction(data[, idvar], 
                drop = TRUE)

Rather than interaction, you could use do.call(paste0, data[, idvar]). You can use a function to create an environment with interaction equal to this faster function.

new_reshape <- function(...){
  interaction <- function(x, drop) do.call(paste0, x)
  environment(reshape) <- environment()
  reshape(...)
}

Now it's much faster

system.time(dfWide <- reshape(data = dfLong,
          v.names = c("Date", "Score"),
          timevar = "Key",
          idvar = c("Index1", "Index2"),
          sep = "_Q",
          direction = "wide"))
 #   user  system elapsed 
 # 35.292   0.538  36.236 

system.time(new_dfWide <- new_reshape(data = dfLong,
          v.names = c("Date", "Score"),
          timevar = "Key",
          idvar = c("Index1", "Index2"),
          sep = "_Q",
          direction = "wide"))

  # user  system elapsed 
  # 0.071   0.009   0.081  

all.equal(new_dfWide, dfWide)
# [1] TRUE

You can be even faster than that by using plyr:::ninteraction. The only non-base dependency of this function is plyr:::id_var, which has no dependencies, meaning if you can't install packages you can just copy-paste this function definition pretty easily (adding a comment giving credit).

new_reshape <- function(...){
  # interaction = plyr:::ninteraction
  # id_var = plyr:::id_var
  interaction <- 
    function (.variables, drop = FALSE) 
    {
        lengths <- vapply(.variables, length, integer(1))
        .variables <- .variables[lengths != 0]
        if (length(.variables) == 0) {
            n <- nrow(.variables) %||% 0L
            return(structure(seq_len(n), n = n))
        }
        if (length(.variables) == 1) {
            return(id_var(.variables[[1]], drop = drop))
        }
        ids <- rev(lapply(.variables, id_var, drop = drop))
        p <- length(ids)
        ndistinct <- vapply(ids, attr, "n", FUN.VALUE = numeric(1), 
            USE.NAMES = FALSE)
        n <- prod(ndistinct)
        if (n > 2^31) {
            char_id <- do.call("paste", c(ids, sep = "\r"))
            res <- match(char_id, unique(char_id))
        }
        else {
            combs <- c(1, cumprod(ndistinct[-p]))
            mat <- do.call("cbind", ids)
            res <- c((mat - 1L) %*% combs + 1L)
        }
        attr(res, "n") <- n
        if (drop) {
            id_var(res, drop = TRUE)
        }
        else {
            structure(as.integer(res), n = attr(res, "n"))
        }
    }  
  id_var <- 
  function (x, drop = FALSE) 
  {
      if (length(x) == 0) 
          return(structure(integer(), n = 0L))
      if (!is.null(attr(x, "n")) && !drop) 
          return(x)
      if (is.factor(x) && !drop) {
          x <- addNA(x, ifany = TRUE)
          id <- as.integer(x)
          n <- length(levels(x))
      }
      else {
          levels <- sort(unique(x), na.last = TRUE)
          id <- match(x, levels)
          n <- max(id)
      }
      structure(id, n = n)
  }
  environment(reshape) <- environment()
  reshape(...)
}
system.time(new_dfWide <- new_reshape(data = dfLong,
          v.names = c("Date", "Score"),
          timevar = "Key",
          idvar = c("Index1", "Index2"),
          sep = "_Q",
          direction = "wide"))

  #  user  system elapsed 
  # 0.015   0.000   0.015 
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • 1
    Good catch! `interaction` was the named culprit on this pretty old R [bugzilla post](https://bugs.r-project.org/show_bug.cgi?id=14121). Would be a nice issue to report to core team! – Parfait Jan 03 '22 at 21:39
  • 1
    Accepted since this answered the question "why" it takes so long. Thanks @Parfait for digging out the bugzilla post which discusses this further. Of course `interaction` creates a factor with ALL possible combinations of Index1 and Index2, which is completely unnecessary for reshaping. In this case it would have tried to create 1E12 factor levels! A simple alternative (also suggested in the bugzilla post) is to concatenate the id variables and use that for the idvar e.g. `paste(Index1, Index2, sep = ":")` For general purpose use, concatenate factor levels to avoid AB & C == A & BC etc. – Knackiedoo Jan 04 '22 at 11:55
  • I thought using paste would still be relatively slow, but was wrong. I've updated the answer to use that since it's much simpler, for people who come across this question later. – IceCreamToucan Jan 04 '22 at 14:05
  • 1
    @Parfait FYI I raised this on Bugzilla a couple of days ago https://bugs.r-project.org/show_bug.cgi?id=18276 – Knackiedoo Jan 06 '22 at 09:19
  • @IceCreamToucan Very neat! A simple paste0 is a little risky, since `paste0("A", "BC")==paste0("AB", "C")` Pasting factor levels with a separator removes that issue. – Knackiedoo Jan 06 '22 at 09:24
2

Consider an advanced modified version of @Moody_Mudskipper's matrix_spread, using base R. Since matrix will simplify complex types like Date, some adhoc changes will be required:

Function

matrix_spread <- function(df1, id, key, value, sep){
  unique_ids <-  unique(df1[[key]])
  mats <- lapply(df1[value], function(x) 
    matrix(x, ncol=length(unique_ids), byrow = FALSE)
  )
  df2 <- do.call(
    data.frame, list(unique(df1[id]), mats)
  )
  
  # RENAME COLS
  names(df2)[(length(id)+1):ncol(df2)] <- as.vector(
    sapply(value, function(x, y) paste0(x, sep, y), unique_ids)
  )
  # REORDER COLS
  df2 <- df2[c(id, as.vector(
    outer(c(value), unique_ids, function(x, y) paste0(x, sep, y))
  ))]
  
  return(df2)
}

Application

system.time(
  dfWide2 <- matrix_spread(
    df1 = dfLong, 
    id = c("Index1", "Index2"),
    key = "Key",
    value = c("Date", "Score"),
    sep = "_Q"
  )
)
#  user  system elapsed 
# 0.022   0.000   0.023 

# CONVERT INTEGERS TO DATES
dfWide2[grep("Date", names(dfWide2))] <- lapply(
  dfWide2[grep("Date", names(dfWide2))],
  as.Date,
  origin = "1970-01-01"
)

# REPLICATES OP'S reshape
identical(data.frame(dfWide), dfWide2)
# [1] TRUE
Parfait
  • 104,375
  • 17
  • 94
  • 125