1

Here's a data frame:

sample_data <- data.frame(
  Device = c("Desktop", "Desktop", "Desktop", "Tablet", "Tablet", "Tablet", "Mobile", "Mobile", "Mobile"),
  Date_Range = seq(as.Date("2017-01-01"), as.Date("2017-01-30"), by = 1),
  Sessions = ceiling(rnorm(90, mean = 3000, sd = 300)),
  Conversion1 = ceiling(rnorm(90, mean = 800, sd = 100)),
  Conversion2 = ceiling(rnorm(90, mean = 400, sd = 50))
)

And it looks like this:

head(sample_data)
   Device Date_Range Sessions Conversion1 Conversion2
1 Desktop 2017-01-01     2906         920         361
2 Desktop 2017-01-02     2604         873         380
3 Desktop 2017-01-03     2449         781         398
4  Tablet 2017-01-04     3205         851         534
5  Tablet 2017-01-05     3099         622         341
6  Tablet 2017-01-06     2708         842         386

If there was only one metric (there's actually 3: Sessions + 2 conversion columns) and I wanted columns as dates for a trended table I could go:

sample_data_spread <- tbl_df(sample_data) %>%
  group_by(Date_Range) %>%
  summarize(Sessions = sum(Sessions)) %>%
  spread(key = Date_Range, value = Sessions)

This looks like:

> sample_data_spread[,1:5]
# A tibble: 1 × 5
  `2017-01-01` `2017-01-02` `2017-01-03` `2017-01-04` `2017-01-05`
         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
1         9238         8937         9129         9384         8222

But what I actually want is a table with the following format:

  • columns: the dates like I have now
  • rows: sessions, Conversion1, Conversion2
  • I'm ignoring Device column for now

What is a good way to do this?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • But Conversion1, Conversion2 have different values for each row. Perhaps `tbl_df(sample_data) %>% group_by(Date_Range) %>% mutate(Sessions = sum(Sessions)) %>% unique() %>% spread(Date_Range, Sessions)` – akrun Jan 31 '17 at 13:05
  • @akrun I'm seeking the sum(metric) for each date. So a row for Sessions, a row for Conversion1 and a row for Conversion2 – Doug Fir Jan 31 '17 at 13:07
  • So, what will be the value for Conversion1 and Conversion2 as it is completely different for each group of date – akrun Jan 31 '17 at 13:07
  • This sounds like a reshape wide, maybe [this post](http://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) will be helpful. – lmo Jan 31 '17 at 13:07

2 Answers2

4

We can do this with melt/dcast from data.table

library(data.table)
dcast(melt(setDT(sample_data), id.var = c("Device", "Date_Range")), 
                variable~Date_Range, value.var = "value", sum)
#   variable 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 2017-01-07 2017-01-08 2017-01-09 2017-01-10 2017-01-11 2017-01-12
#1:    Sessions       9010       9586       8855       9160       8662       9509       8041       8106       9840       8618       8485       8417
#2: Conversion1       2802       2339       2348       2548       2570       2681       2536       2207       2479       2271       2417       2424
#3: Conversion2       1131       1300       1198       1187       1293       1185       1219       1400       1213       1080       1230       1119
#   2017-01-13 2017-01-14 2017-01-15 2017-01-16 2017-01-17 2017-01-18 2017-01-19 2017-01-20 2017-01-21 2017-01-22 2017-01-23 2017-01-24 2017-01-25
#1:       8656       8985       9674       8635       8348       9044       9516       9993       9230       8698       8773      10004       8274
#2:       2505       2527       2115       2530       2437       2461       2234       2614       2384       2306       2560       2559       2431
#3:       1246       1249       1221       1172       1236       1147       1297       1158       1173       1178       1060       1175       1210
#   2017-01-26 2017-01-27 2017-01-28 2017-01-29 2017-01-30
#1:       8759       9135       8287       8928       9020
#2:       2209       2427       2343       2380       2305
#3:       1265       1226       1175       1171       1122

Or using tidyrverse

library(dplyr)
library(tidyr)
sample_data %>% 
      group_by(Date_Range) %>%
      summarise_each(funs(sum), - Device) %>% 
      gather(Var, Val, - Date_Range) %>%
      group_by(Date_Range) %>% 
      mutate(ind = row_number()) %>% 
      spread(Date_Range, Val)
# A tibble: 3 × 32
#           Var   ind `2017-01-01` `2017-01-02` `2017-01-03` `2017-01-04` `2017-01-05` `2017-01-06` `2017-01-07` `2017-01-08` `2017-01-09` `2017-01-10`
#*       <chr> <int>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#1 Conversion1     2         2802         2339         2348         2548         2570         2681         2536         2207         2479         2271
#2 Conversion2     3         1131         1300         1198         1187         1293         1185         1219         1400         1213         1080
#3    Sessions     1         9010         9586         8855         9160         8662         9509         8041         8106         9840         8618
# ... with 20 more variables: `2017-01-11` <dbl>, `2017-01-12` <dbl>, `2017-01-13` <dbl>, `2017-01-14` <dbl>, `2017-01-15` <dbl>, `2017-01-16`      <dbl>,
#   `2017-01-17` <dbl>, `2017-01-18` <dbl>, `2017-01-19` <dbl>, `2017-01-20` <dbl>, `2017-01-21` <dbl>, `2017-01-22` <dbl>, `2017-01-23` <dbl>,
#   `2017-01-24` <dbl>, `2017-01-25` <dbl>, `2017-01-26` <dbl>, `2017-01-27` <dbl>, `2017-01-28` <dbl>, `2017-01-29` <dbl>, `2017-01-30` dbl>
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for answering but that's not giving me what I want. It would be something like (editing your last line in an imaginary world): spread(Date_Range, c(Sessions, Conversion1, Conversion2)) So for example date 1st January 2017 would have 3 rows. One for the sum total of Sessions on that date, one for the sum total of Conversion1 and one for the sum total of Conversion2 on that date – Doug Fir Jan 31 '17 at 13:11
  • @DougFir I am still not sure what your expected output will be as there are multiple Conversion for each Date_Range – akrun Jan 31 '17 at 13:12
  • If it was an Excel pivot table I would have date range as columns and then drop in 3 values, one for each of the metric – Doug Fir Jan 31 '17 at 13:13
  • Sorry I don't think I communicated this one very well. Answer above did the trick – Doug Fir Jan 31 '17 at 13:17
  • Thanks @akrun. What is gather doing? I read ?gather but really don't follow from gather() onwards? Thanks for your help with this so far – Doug Fir Jan 31 '17 at 13:58
  • 1
    @DougFir gather is the equivalent of melt in reshape2, it turns data into a long format. – Haboryme Jan 31 '17 at 14:02
  • @Haboryme Thanks for answering the comment. – akrun Jan 31 '17 at 15:44
  • tidyverse method worked great for me, found it much simpler than using data.table::dcast – Rocco Jul 09 '18 at 10:09
3

Maybe something like:

newdf=t(mapply(function(x) tbl_df(sample_data) %>%
         group_by(Date_Range) %>%
         summarize_(x = paste0("sum(",x,")")) %>%
         spread(key = Date_Range, value = x),list("Sessions","Conversion1","Conversion2")))

rownames(newdf)=list("Sessions","Conversion1","Conversion2")
Haboryme
  • 4,611
  • 2
  • 18
  • 21