2

I have a dataframe with sales. I need to aggregate the df by 2 columns ProductID and Day and sum the values of each aggregated group from a different column Amount so that it now shows the total. I wish to keep the other columns that can be grouped as well (same values across rows), in this case just Product. The final column Store won't be kept, since values can vary within grouped rows. However, I need to add a column UniqueStores, which counts the amount of unique stores for each group of same ProductID and Day. For example, the first group with ID=1 and Day= Monday would have 1 unique store "N", so value would be 1.

I tried drafting the table here in text but I couldn't format it correctly, so here as an image of how it looks before aggregating:

Table view

I've tried aggregating with both group_by + summarise and df[,sum,by] but they don't keep the variables that aren't given as indexes. Is there a workaround without having to manually insert every column that shall remain?

Final View

Thanks in advance and I hope I made myself clear.

Input values:

df <- data.frame("ProductID" = c(1,1,1,1,2,2,2,2), "Day"=c("Monday","Monday", "Tuesday", "Tuesday","Wednesday", "Wednesday", "Friday", "Friday"), "Amount"=c(5,5,3,7,6,9,5,2), "Product"=c("Food","Food","Food","Food","Toys","Toys","Toys","Toys"), "Store"=c("N","N","W","N", "S","W", "S","S"))

Henrik
  • 65,555
  • 14
  • 143
  • 159
Aarón Gzz
  • 99
  • 6

2 Answers2

3

We can do a group by operation in dplyr and summarise with the sum of 'Amount' and n_distinct (number of distinct elements of 'Store')

library(dplyr)
df %>% 
  group_by(ProductID, Day, Product) %>%
  summarise(Amount = sum(Amount), 
       UniqueStores = n_distinct(Store), .groups = 'drop')
# A tibble: 4 x 5
#  ProductID Day       Product Amount UniqueStores
#      <dbl> <chr>     <chr>    <dbl>        <int>
#1         1 Monday    Food        10            1
#2         1 Tuesday   Food        10            2
#3         2 Friday    Toys         7            1
#4         2 Wednesday Toys        15            2

If there are multiple columns, and want to subset only a part of the columns, while keeping the rest, an option is to mutate in the dataset and then use distinct to get the first row

df %>% 
  group_by(ProductID, Day, Product) %>%
  mutate(Amount = sum(Amount), 
       UniqueStores = n_distinct(Store), .keep = 'all') %>%
  ungroup %>%
  distinct(ProductID, Day, Product, .keep_all = TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
1

In data.table:

library(data.table)

setDT(df)[, .(Amount = sum(Amount, na.rm = TRUE),
              UniqueStores = uniqueN(Store, na.rm = TRUE)), 
          by = .(ProductID, Day, Product)
          ]

Output:

   ProductID       Day Product Amount UniqueStores
1:         1    Monday    Food     10            1
2:         1   Tuesday    Food     10            2
3:         2 Wednesday    Toys     15            2
4:         2    Friday    Toys      7            1
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
  • what is the purpose of the period after by= ? – Aarón Gzz Sep 16 '20 at 14:39
  • I used your solution and but want to use a vector with the column names as argument for by=() instead of writing them all in. Any idea how I could implement that? I'm getting an error regarding the length if I just do by = .(vector_colnames), as if one of the variables had a different amount of rows. – Aarón Gzz Sep 16 '20 at 18:08
  • Just put the name of vector after = – arg0naut91 Sep 16 '20 at 18:16
  • 1
    that didnt work :/ but I solved it using by=eval(vectorname). Thank you for your help! – Aarón Gzz Sep 16 '20 at 18:24