46

Recently I am doing all my data manipulations using dplyr and it is an excellent tool for that. However I am unable to melt or cast a data frame using dplyr. Is there any way to do that? Right now I am using reshape2 for this purpose.

I want 'dplyr' solution for:

require(reshape2)
data(iris)
dat <- melt(iris,id.vars="Species")
micstr
  • 5,080
  • 8
  • 48
  • 76
Koundy
  • 5,265
  • 3
  • 24
  • 37
  • 14
    The successor to `reshape2` is `tidyr`. The equivalent of `melt` and `dcast` are `gather` and `spread` respectively. It is not available on CRAN yet, but you can download it from github (https://github.com/hadley/tidyr)! – konvas Jul 22 '14 at 07:08
  • 3
    @konvas Update: `tidyr` is now on CRAN (http://cran.r-project.org/web/packages/tidyr/index.html) – dickoa Jul 22 '14 at 07:38
  • @konvas why do you not just put it as proper answer? – Beasterfield Jul 22 '14 at 07:40
  • @dickoa it is as of yesterday!! :) thanks for letting me know! – konvas Jul 22 '14 at 07:54
  • @Beasterfield I think a proper answer would involve more detail, as e.g. to how to use `gather` to achieve the output of the `melt` example in the OP and I did not have time for it. But I thought I'd let @koundy know anyhow... – konvas Jul 22 '14 at 07:56

3 Answers3

78

The successor to reshape2 is tidyr. The equivalent of melt() and dcast() are gather() and spread() respectively. The equivalent to your code would then be

library(tidyr)
data(iris)
dat <- gather(iris, variable, value, -Species)

If you have magrittr imported you can use the pipe operator like in dplyr, i.e. write

dat <- iris %>% gather(variable, value, -Species)

Note that you need to specify the variable and value names explicitly, unlike in melt(). I find the syntax of gather() quite convenient, because you can just specify the columns you want to be converted to long format, or specify the ones you want to remain in the new data frame by prefixing them with '-' (just like for Species above), which is a bit faster to type than in melt(). However, I've noticed that on my machine at least, tidyr can be noticeably slower than reshape2.

Edit In reply to @hadley 's comment below, I'm posting some timing info comparing the two functions on my PC.

library(microbenchmark)
microbenchmark(
    melt = melt(iris,id.vars="Species"), 
    gather = gather(iris, variable, value, -Species)
)
# Unit: microseconds
#    expr     min       lq  median       uq      max neval
#    melt 278.829 290.7420 295.797 320.5730  389.626   100
#  gather 536.974 552.2515 567.395 683.2515 1488.229   100

set.seed(1)
iris1 <- iris[sample(1:nrow(iris), 1e6, replace = T), ] 
system.time(melt(iris1,id.vars="Species"))
#    user  system elapsed 
#   0.012   0.024   0.036 
system.time(gather(iris1, variable, value, -Species))
#    user  system elapsed 
#   0.364   0.024   0.387 

sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-pc-linux-gnu (64-bit)
# 
# locale:
#  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] reshape2_1.4         microbenchmark_1.3-0 magrittr_1.0.1      
# [4] tidyr_0.1           
# 
# loaded via a namespace (and not attached):
# [1] assertthat_0.1 dplyr_0.2      parallel_3.1.1 plyr_1.8.1     Rcpp_0.11.2   
# [6] stringr_0.6.2  tools_3.1.1   
konvas
  • 14,126
  • 2
  • 40
  • 46
  • It shouldn't be noticeably slower since it's basically all the same code. If you can provide a reproducible example, I'd love to see it. – hadley Jul 25 '14 at 21:53
  • 1
    @hadley I've posted some info. I realise this is probably not due to the code and may be specific to my system. The 'user' part of `system.time()` seems to be what makes the difference, although I am not exactly sure what this represents, but I'm sure you'll know :) – konvas Jul 28 '14 at 08:59
  • @hadley For me too melt performs faster than gather --- will stuck to it for a while. – apc Aug 07 '14 at 06:19
  • That's really weird. I'll take a look. – hadley Aug 07 '14 at 17:36
  • 4
    Great answer, and nice work Hadley, but only tackles half the question! A spread example would be good too – Louis Maddox Apr 28 '15 at 19:47
  • 1
    As an update to this answer, gather() and spread() are now pivot_longer() and pivot_wider() as of tidyr 1.0.0, although running similar benchmarks shows melt is still faster than either gather or pivot_longer (based on tidyR 1.1.4), pivot_longer is currently the slowest of the three options. – 2D1C Dec 31 '21 at 14:38
7

In addition, cast can be using tidyr::spread()

Example for you

library(reshape2)
library(tidyr)
library(dplyr)

# example data : `mini_iris`
(mini_iris <- iris[c(1, 51, 101), ])

# melt
(melted1 <- mini_iris %>% melt(id.vars = "Species"))         # on reshape2
(melted2 <- mini_iris %>% gather(variable, value, -Species)) # on tidyr

# cast
melted1 %>% dcast(Species ~ variable, value.var = "value") # on reshape2
melted2 %>% spread(variable, value)                        # on tidyr
Lovetoken
  • 438
  • 4
  • 11
2

To add to answers above using @Lovetoken's mini_iris example (this is too complex for a comment) - for those newcomers who do not understand what is meant by melt and casting.

library(reshape2)
library(tidyr)
library(dplyr)

# example data : `mini_iris`
mini_iris <- iris[c(1, 51, 101), ]

# mini_iris
#Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1            5.1         3.5          1.4         0.2     setosa
#51           7.0         3.2          4.7         1.4 versicolor
#101          6.3         3.3          6.0         2.5  virginica

Melt is taking the dataframe and expanding into a long list of values. Not efficient but can be useful if you need to combine sets of data. Think of the structure of an icecube melting on a tabletop and spreading out.

melted1 <- testiris %>% melt(id.vars = "Species")

> nrow(melted1)
[1] 12

head(melted1)
# Species     variable      value
# 1     setosa Sepal.Length   5.1
# 2 versicolor Sepal.Length   7.0
# 3  virginica Sepal.Length   6.3
# 4     setosa  Sepal.Width   3.5
# 5 versicolor  Sepal.Width   3.2
# 6  virginica  Sepal.Width   3.3

You can see how the data has now been broken into many rows of value. The column names are now text within a variable column.

casting will reassemble back to a data.table or data.frame.

micstr
  • 5,080
  • 8
  • 48
  • 76