Can I make dataframe that summarises/aggregates data from a much larger one?

Question

I've got a horrendously large dataframe (covering hundreds of days worth of data) that contains data of the following pattern:

df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
                category = c(UKS, USD, UKS, UKS, USD, USD, UKS, USD, UKZ, UKY),
                time = c(07:59:53, 08:00:03, 08:00:03, 08:00:03, 08:00:03, 08:00:04, 08:00:08, 08:00:11, 08:00:14, 08:00:15)
                quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008) 
                cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117)
                type = c(TSV, OSN, TSS, TSV, TSS, TSS, OSN, TSV, OSN, TSS)

This dataframe cannot be changed, however what I would like to do is create a 'summary' dataframe from this one that sums together the total quantity for each category and type per day, as well as providing a total quantity on that day.

So using the above example:

For 2021-01-09 
Total Quantity = 0.117
Total UKS = 0.052
Total USD = 0.056
Total UKZ = 0.001
Total UKY = 0.008

Does anyone have any advice on how to achieve this for all the days I have data for?

Your sample data doesn't run - it's missing at least a few commas and a `)`. Also note that `data.table`(with a `.`) is an hyperefficient R package with it's own syntax that you don't seem to be using. If you're using base R, to avoid confusion I'd suggest calling your data a "data frame" not "datatable". Unless you are using `data.table`, in which case please use the `data.table` tag (with a `.`) and use code to create a `data.table` object, not a `data.frame`. — Gregor Thomas, Feb 09 '22 at 14:18

score 1 · Accepted Answer · answered Feb 09 '22 at 14:17

Your table requires some fixing, but once you do that, you can do this

library(data.table)
setDT(df)

df[, .(cat_total = sum(quantity)), by=.(date,category)] %>% 
  .[, date_total:=sum(cat_total), by=date] %>% 
  .[]

Output:

         date category cat_total date_total
1: 2021-01-09      UKS     0.052      0.117
2: 2021-01-09      USD     0.056      0.117
3: 2021-01-09      UKZ     0.001      0.117
4: 2021-01-09      UKY     0.008      0.117

benson23 · Answer 2 · 2022-02-09T14:51:45.097

Although your question title includes "datatable", your input is a data.frame. Therefore I'll contribute a tidyverse approach here.

library(tidyverse)

df %>% group_by(date, category) %>% 
  summarize(Total_quantity = sum(quantity), .groups = "drop") %>% 
  group_by(date) %>% 
  mutate(Total_date = sum(Total_quantity)) 

# A tibble: 4 x 4
# Groups:   date [1]
  date       category Total_quantity Date_total
  <chr>      <chr>             <dbl>      <dbl>
1 2021-01-09 UKS               0.052      0.117
2 2021-01-09 UKY               0.008      0.117
3 2021-01-09 UKZ               0.001      0.117
4 2021-01-09 USD               0.056      0.117

JebLuvsStats · Answer 3 · 2022-02-09T14:34:01.767

Here's one way you could do it using sqldf. I think the sql language is pretty easy to understand for beginners and gives you another generalized tool to utilize. Here the UNION is key to join the 'total' answer and the 'category' answer. BTW, I slightly modified your data.frame to work.

df = data.frame(date = c('2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09', '2021-01-09'),
                category = c('UKS', 'USD', 'UKS', 'UKS', 'USD', 'USD', 'UKS', 'USD', 'UKZ', 'UKY'),
                time = c('07:59:53', '08:00:03', '08:00:03', '08:00:03', '08:00:03', '08:00:04', '08:00:08', '08:00:11', '08:00:14', '08:00:15'),
                quantity = c(0.001, 0.003, 0.018, 0.010, 0.043, 0.005, 0.023, 0.005, 0.001, 0.008), 
                cumvol = c(0.001, 0.004, 0.022, 0.032, 0.075, 0.080, 0.103, 0.108, 0.109, 0.117),
                type = c('TSV', 'OSN', 'TSS', 'TSV', 'TSS', 'TSS', 'OSN', 'TSV', 'OSN', 'TSS'))

library('sqldf')

sqldf("select Date, 'ALL' as [Category], 
             sum(quantity) as [Quantity]
      from df
      group by Date
      UNION
      select Date, Category,
             sum(quantity) as [Quantity]
      from df
      group by Date, category
      order by sum(Quantity) desc")

OUTPUT:

        date Category Quantity
1 2021-01-09      ALL    0.117
2 2021-01-09      USD    0.056
3 2021-01-09      UKS    0.052
4 2021-01-09      UKY    0.008
5 2021-01-09      UKZ    0.001

Can I make dataframe that summarises/aggregates data from a much larger one?

3 Answers3