0

I know there are plenty of question specifically asking about melting or altering dataframes but after a bit of research I could not find any that directly related to my question.

Each month, our company uploads a data file that has sales data from the previous 24 months. So in the last three uploads, each file had a data point for 07/2017. The problem is, sometimes those data points differ even though they should theoretically be exactly the same.

The current table that I have merges the last three uploads by product brand as shown below. The column names are the actual months within the data frame and the rows have three repeats of each brand to account for each of the last three uploads.

I want to better compare the data and change the table (I don't know if I need to melt, mutate, transpose, etc.). Essentially, I need the final table to show each brand for each month (this will actually end up being 26 I think because across three uploads there will be an additional month on the back and on the front of the first and third upload). Along with their data and ideally the standard deviation across uploads.

Any tips or suggestions would be very much appreciated.

> current_table
    Brand 04/2017 05/2017 06/2017 07/2017
     x       10      11      9       7
     x       11      11      8       9
     x       17      13      8       9
     y       10      12      4       9 
     y       14      12      6       9
     y       15      18      8       9 

> needed_table
    Brand Upload1 Upload2 Upload3   SD   Month
     x       10      11      17    3.79  04/2017 
     x       11      11      13    1.15  05/2017
     x        9       8       8     .58  06/2017
     x        7       9       9    1.15  07/2017
     y       10      14      15    2.65  04/2017
     y       12      12      18    3.46  05/2017
     y        4       6       8    2.00  06/2017
     y        9       9       9    0.00  07/2017

> dput(head(final_merged))
structure(list(Brand = structure(1:6, .Label = c("BrandA", "BrandB", 
"BrandC", "BrandD", "BrandE", "BrandF", "BrandG", 
"BrandH", "BrandI", "BrandJ"), class = "factor"), 
    `2016-05-01` = c(542600.938987453, 121369.200906975, 194807.339528314, 
    26130.0960930597, 47328.5580057147, 878650.440203509), `2016-06-01` = c(427613.257159588, 
    154315.362993513, 293030.414270183, 35400.7075418873, 61389.4811620766, 
    1052195.59004308), `2016-07-01` = c(256692.139087568, 105309.022725613, 
    251793.570894072, 23245.2160336776, 45525.4213585196, 820502.422580008
    ), `2016-08-01` = c(473379.365918433, 100168.559639524, 186741.222637413, 
    31521.7915518086, 55076.6485031421, 1041982.37829743), `2016-09-01` = c(490091.694067029, 
    131598.231274657, 318420.372833846, 24057.2447362826, 56038.1064247602, 
    993577.092222938), `2016-10-01` = c(406522.567301472, 128957.624222936, 
    261565.423872617, 33422.3089757409, 57736.9698688817, 678066.590110671
    ), `2016-11-01` = c(477084.744242994, 133275.014021741, 284441.459082257, 
    30410.5664089452, 56822.1141268123, 1064886.69561047), `2016-12-01` = c(521829.584534476, 
    97983.5318536767, 208008.005137491, 31289.318, 52479.0371167326, 
    843310.325681422), `2017-01-01` = c(273501.342043654, 80012.2533117421, 
    195121.917662579, 22666.14, 54259.0136360474, 668189.841756365
    ), `2017-02-01` = c(685662.788776875, 127218.360798343, 242264.578372733, 
    25834.168, 53822.8322244458, 838142.3930596), `2017-03-01` = c(391139.806441577, 
    165270.729667372, 344475.891325452, 24324.698, 51447.0694269768, 
    1163066.49036161), `2017-04-01` = c(513219.288877507, 117680.874763073, 
    225660.11443535, 25882.934, 50427.8355378495, 705879.756352247
    ), `2017-05-01` = c(588841.848447216, 85962.4107041087, 205370.047642541, 
    24528.452, 50265.0630719988, 1172193.61025203), `2017-06-01` = c(592141.397064366, 
    186725.961678379, 239468.615831021, 31486.9, 48270.5745992247, 
    966639.315719368), `2017-07-01` = c(401320.540832368, 100871.630859688, 
    262648.979237381, 25590.72, 56131.8936476386, 531673.227676397
    ), `2017-08-01` = c(641511.11756313, 136289.088301959, 270709.662423172, 
    32773.69, 69511.1772105425, 1202000.7904115), `2017-09-01` = c(526890.631147415, 
    139847.451118091, 222951.630747679, 27480.974, 60645.622041419, 
    775667.840682767), `2017-10-01` = c(643121.78985178, 180025.223477481, 
    284514.628237396, 39847.288, 74247.6270395629, 1116881.49653723
    ), `2017-11-01` = c(641964.06004378, 170480.722936376, 236539.486734265, 
    33211.102, 65337.5560467566, 928064.977296857), `2017-12-01` = c(701220.126734004, 
    147016.439282975, 262372.515457345, 19004.936, 63303.3621122444, 
    886702.202033658), `2018-01-01` = c(522948.466272953, 94834.4834012872, 
    185624.076471366, 27964.772, 54902.1470296058, 394395.594443823
    ), `2018-02-01` = c(655763.722465697, 149547.490883877, 161965.604862219, 
    25338.488, 50546.0076345216, 734257.598467733), `2018-03-01` = c(614499.617793727, 
    179165.564404258, 380965.227731128, 32299.332, 52258.5402998062, 
    1048116.95924033), `2018-04-01` = c(587274.401556706, 158130.345063145, 
    130088.683628109, 32235.478, 51718.8379910518, 503405.505894426
    ), `2018-05-01` = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), `2018-06-01` = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Brand", 
"2016-05-01", "2016-06-01", "2016-07-01", "2016-08-01", "2016-09-01", 
"2016-10-01", "2016-11-01", "2016-12-01", "2017-01-01", "2017-02-01", 
"2017-03-01", "2017-04-01", "2017-05-01", "2017-06-01", "2017-07-01", 
"2017-08-01", "2017-09-01", "2017-10-01", "2017-11-01", "2017-12-01", 
"2018-01-01", "2018-02-01", "2018-03-01", "2018-04-01", "2018-05-01", 
"2018-06-01"), row.names = c(NA, 6L), class = "data.frame")
>
  • 1
    Would you mind to post your example data rather with `dput()` or `dput(head())`? It is a pain for users intending to help you to cobble together data from such copy-pasted outputs. For help on this, read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 Cheers! – jay.sf Jul 24 '18 at 15:28
  • Hi there @jaySf. Is that what you meant? I just added the output. – Cameron Kormylo Jul 24 '18 at 15:38
  • The edit which included the result of `dput(head(final_merged)) ` is somewhat confusing as is does not to reproduce the contents of `current_table`. – Uwe Jul 24 '18 at 16:17

2 Answers2

2
 library(dplyr)
 df %>% group_by(Brand) %>% 
        mutate(row=case_when(row_number()==1 ~ "Upload1", row_number()==2 ~ "Upload2", 
                             row_number()==3 ~ "Upload3", TRUE ~ as.character(x))) %>%
        gather(Month, value, -Brand, -row) %>% spread(row, value) %>% ungroup() %>% 
        mutate(SD=apply(.[(3:5)],1,sd), Month = gsub('X','',gsub('\\.','/',Month)))

    # A tibble: 8 x 6
  Brand Month   Upload1 Upload2 Upload3    SD
  <chr> <chr>     <int>   <int>   <int> <dbl>
 1 x     04/2017      10      11      17 3.79 
 2 x     05/2017      11      11      13 1.15 
 3 x     06/2017       9       8       8 0.577
 4 x     07/2017       7       9       9 1.15 
 5 y     04/2017      10      14      15 2.65 
 6 y     05/2017      12      12      18 3.46 
 7 y     06/2017       4       6       8 2    
 8 y     07/2017       9       9       9 0 

Data

df <- read.table(text="
            Brand 04/2017 05/2017 06/2017 07/2017
             x       10      11      9       7
             x       11      11      8       9
             x       17      13      8       9
             y       10      12      4       9 
             y       14      12      6       9
             y       15      18      8       9
              ",header=T, stringsAsFactors = F)
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • Thank you. I am getting an error though. `Error in mutate_impl(.data, dots) : Evaluation error: NA/NaN argument. In addition: Warning messages: 1: In 1:3 : numerical expression has 26 elements: only the first used 2: In 1:3 : numerical expression has 26 elements: only the first used > ` – Cameron Kormylo Jul 24 '18 at 15:51
  • Now I get `Error in loadNamespace(name) : there is no package called ‘utf8’` – Cameron Kormylo Jul 24 '18 at 16:01
2

In data.table syntax and using melt() and dcast() this would become:

library(data.table)
melt(setDT(current_table), id.vars = "Brand", variable.name = "Month")[
  , SD := round(sd(value), 2L), by = .(Brand, Month)][
    , dcast(.SD, Brand + Month + SD ~ rowid(Brand, Month, prefix = "Upload"))]

Note that SD is computed in long format. So, it is independent of the number of uploads.

   Brand   Month   SD Upload1 Upload2 Upload3
1:     x 04/2017 3.79      10      11      17
2:     x 05/2017 1.15      11      11      13
3:     x 06/2017 0.58       9       8       8
4:     x 07/2017 1.15       7       9       9
5:     y 04/2017 2.65      10      14      15
6:     y 05/2017 3.46      12      12      18
7:     y 06/2017 2.00       4       6       8
8:     y 07/2017 0.00       9       9       9

Data

library(data.table)
current_table <- fread(
  "Brand 04/2017 05/2017 06/2017 07/2017
     x       10      11      9       7
     x       11      11      8       9
     x       17      13      8       9
     y       10      12      4       9 
     y       14      12      6       9
     y       15      18      8       9 "
)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134