Spread vs dcast

Question

I have a table like this,

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91.63237     pred   lwr 73.33198
3   72.6  186.5 84.55326     pred   lwr 66.31751
4   78.8  187.2 85.26117     pred   lwr 67.02004
5   74.8  181.5 79.49675     pred   lwr 61.29244
6   86.4  184.0 82.02501     pred   lwr 63.80652

I want it to have like this,

> head(reshape2::dcast(dt2, 
         Weight + Height + Fitted + interval ~ limit, 
         fun.aggregate = mean))
  Weight Height   Fitted interval      lwr      upr
1   42.0  153.4 51.07920     conf 49.15463 53.00376
2   42.0  153.4 51.07920     pred 32.82122 69.33717
3   43.2  160.0 57.75378     conf 56.35240 59.15516
4   43.2  160.0 57.75378     pred 39.54352 75.96404
5   44.8  149.5 47.13512     conf 44.87642 49.39382
6   44.8  149.5 47.13512     pred 28.83891 65.43133

But using tidyr::spread, How can I do that?

I was using,

> tidyr::spread(dt2, limit, value)

But getting the error,

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)

Random 10 Rows::

> dt[sample(nrow(dt), 10), ]
     Weight Height   Fitted interval limit    value
1253   52.2  162.5 60.28203     conf   upr 61.51087
426    49.1  158.8 56.54022     pred   upr 74.75756
1117   78.4  184.5 82.53066     conf   lwr 80.98778
1171   85.9  166.4 64.22611     conf   lwr 63.21254
948    61.4  177.8 75.75494     conf   lwr 74.66393
384    90.9  172.7 70.59731     pred   lwr 52.41828
289    75.9  172.7 70.59731     pred   lwr 52.41828
3      44.8  149.5 47.13512     pred   lwr 28.83891
774    87.3  182.9 80.91258     pred   upr 99.12445
772    86.4  175.3 73.22669     pred   upr 91.40919

you example doesn't contain `upr` in `limit`, nor `conf` in `interval`, which means your expected outcome is not reproducible — mtoto, Feb 05 '16 at 13:25
Why not keep it in long format and just aggregate? See [here for an example](http://stackoverflow.com/a/32795497/2204410) with base R, *dplyr* & *data.table*. — Jaap, Feb 05 '16 at 13:37
Although, I have done it with dcast, I want to do it using tidyr just to learn things. @mtoto That is just a head of my dataset, I will edit it to give you a random sample, for reproducibility. — TheRimalaya, Feb 05 '16 at 13:42
This should work: `dt2 %>% group_by(interval, limit) %>% summarise_each(funs(mean)) %>% spread(limit, value, -c(1:3))` — Jaap, Feb 05 '16 at 13:46
That summarized by interval and limit, and gave me just two rows. — TheRimalaya, Feb 05 '16 at 13:48
both your `tidyr` and `dcast `code give the same output with your example data — mtoto, Feb 05 '16 at 13:57
I am sorry to give you just a random sample, I is working, but I am not sure why it is giving the error of Duplicate identifiers when I am using the complete data. Anyway thank you very much. — TheRimalaya, Feb 05 '16 at 15:00

score 13 · Accepted Answer · edited Feb 09 '18 at 09:36

Let's say you were starting with data that looked like this:

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

This matches the expected output from dcast with fun.aggregate = mean.

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

Sample data:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")

Thanks! I was thinking about how to deal with the aggregating function. I think Hadely wants `tidyr` to use along with `dplyr`. — TheRimalaya, Feb 05 '16 at 23:53
This is an excellent answer and made me understand the difference between `dcast` and `spread`. Thank you! — Mikko, Feb 09 '18 at 09:27

score 1 · Answer 2 · answered Feb 05 '16 at 16:42

Here are data.table alternatives to dplyr. Use mydf from Ananda's answer.

library(data.table)
library(magrittr)
library(tidyr)

DT <- data.table(mydf)

First, you can use by to compute the mean by each limit.

DT[, .(lwr = mean(value[limit == "lwr"]), 
       upr = mean(value[limit == "upr"])), 
   by = .(Weight, Height, Fitted, interval)]

If this limit == ... looks too much hard coding, you can first aggregate into a long format, then spread. This works because once you aggregate, there is no duplicate.

DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
  spread(key = "limit", value = "value")

Both gets you

#   Weight Height  Fitted interval      lwr      upr
#1:     42  153.4 51.0792     conf 54.57731 53.00376
#2:     42  153.4 51.0792     pred 61.41061 69.33717

Thanks, Actually I was talking about `dplyr` and `tidyr`. I had already solved the problem with `reshape2` but I want to know how to do it with those specific package. Thanks anyway! — TheRimalaya, Feb 05 '16 at 23:51

Spread vs dcast

2 Answers2

Linked

Related