Collapse and summarize while maintaining most frequent character variable by group

Question

I have a data frame:

df <- data.frame(resource = c("gold", "gold", "gold", "silver", "silver", "gold", "silver", "bronze"), amount = c(500, 2000, 4, 8, 100, 2000, 3, 5), unit = c("g", "g", "kg", "ton", "kg", "g", "ton", "kg"), price = c(10, 10, 10000, 50000, 50, 10, 50000, 20))

I want to calculate the total amount of each resource in the most common unit.

My idea to approach this would be to:

calculate the total value of each resource
determine the most common unit of measurement and the corresponding price
divide the total value by this price while maintaining a variable that indicates the measurement unit and ideally the price as well.

The result should look like

resource value  price unit amount
bronze   100    20     kg  100  
gold     85000  10     g   8500
silver   555000 50000  ton 11.1

If two measurement units are equally frequent it can be random, which one is used.

What are you really trying to do? As your data scales, the number of columns in the output will grow. And there's no obvious order mapping rows to columns. Perhaps you would rather have your original table ordered by resource? — PeterK, Jul 18 '22 at 13:23
I see, I guess what I am actually trying to do is calculate the amount for each resource in a single unit. My dataset has a big number of different units and resources so I thought this might be the best way. Thinking about it, maybe it would be better to calculate the total value per resource and then divide it by the price that is most common (i.e. the unit most often used). Changing one unit into another is difficult, because the dataset includes measurement units like carat, barrels etc. I understand now that my first approach was not the best idea. — Anton, Jul 18 '22 at 13:37

score 1 · Answer 1 · answered Jul 18 '22 at 13:30

I doubt that the format you desire is really helpful (as PeterK pointed out in the comments), but here we go:

df <- data.frame(resource = c("gold", "gold", "gold", "silver", "silver", "gold", "silver", "bronze"), amount = c(500, 2000, 4, 8, 100, 2000, 3, 5), unit = c("g", "g", "kg", "ton", "kg", "g", "ton", "kg"), price = c(10, 10, 10000, 50000, 50, 10, 50000, 20))

# calculate total value
DT <- setDT(df)[, .(value = sum(amount * price)), by = resource]

# create wide data
#  variables we want to cast wide
cols <- c("amount", "unit")
#  cast to wide
DT.wide <- dcast(setDT(df), resource ~ rowid(resource), value.var = cols)
new_colorder <- CJ(unique(rowid(df$resource)), cols, sorted = FALSE)[, paste(cols, V1, sep = "_")]
#  reorder the relevant columns
setcolorder(DT.wide, c(setdiff(names(DT.wide), new_colorder), new_colorder))

# join together
DT[DT.wide, on = .(resource)]

#    resource  value amount_1 unit_1 amount_2 unit_2 amount_3 unit_3 amount_4 unit_4
# 1:   bronze    100        5     kg       NA   <NA>       NA   <NA>       NA   <NA>
# 2:     gold  85000      500      g     2000      g        4     kg     2000      g
# 3:   silver 555000        8    ton      100     kg        3    ton       NA   <NA>

score 0 · Answer 2 · answered Jul 18 '22 at 14:51

I found an answer to my edited question following this thread Return most frequent string value for each group and modifying the code to fit my needs:

df %>% group_by(resource) %>% summarize (value = sum(price * amount), freq_unit =names(which.max(table(unit))), freq_price = as.numeric(names(which.max(table(price)))), total_amount = value/freq_price)

gives me

 resource  value freq_unit freq_price total_amount
  <chr>     <dbl> <chr>          <dbl>        <dbl>
1 bronze      100 kg                20          5  
2 gold      85000 g                 10       8500  
3 silver   555000 ton            50000         11.1

Collapse and summarize while maintaining most frequent character variable by group

2 Answers2