How to aggregating data conditional on other criteria

Question

I have a dataframe called sales.

I would like to sum each month of sales by the common identifier, ID.

What I am finding hard, is how to do this conditional on there being a return value.

For example, when I sum sales_Feb1980 for ID 2, only the observations with returns should be included and the ones without should be ignored?

What the data looks like.. I have many more rows and columns.

   id    name  Returns_jan1980 Returns_feb1980 Sales_Jan1980 Sales_Feb1980   
  2b    LA         4.7%            5.6%             100              100
  2b    LA         5.7%                             250              100
  2b    LA         6.3%            5.8%             50               100
  5a    NY                         5.6%             30               200
  5a    NY         5.7%            3.6%             50               100

Correct output

  id   name Sales_Jan1980 Sales_Jan19080   
  2       LA     400          200        
  5       NY     50           300

Any help is appreciated.

score 0 · Answer 1 · answered Jun 17 '19 at 06:03

Using base R, we can first find out "Returns" and "Sales" columns and assuming they are in correct order we can turn the empty "Return" columns to 0 in their corresponding "Sales" column and then use aggregate to group by id and Name and take sum.

return_cols <- grep("^Returns", names(df))
sales_cols <- grep("^Sales", names(df))    

df[sales_cols][df[return_cols] == ""] <- 0
aggregate(cbind(Sales_Jan1980, Sales_Feb1980)~id + name, df, sum)

#  id name Sales_Jan1980 Sales_Feb1980
#1 2b   LA           400           200
#2 5a   NY            50           300

score 0 · Answer 2 · answered Jun 17 '19 at 06:06

One dplyr and tidyr possibility could be:

df %>%
 rowid_to_column() %>%
 gather(var, val, -c(id, rowid, name)) %>%
 arrange(rowid, sub(".*_", "", var)) %>%
 group_by(rowid) %>%
 mutate(val = ifelse(is.na(lag(val, default = first(val))), NA, val)) %>%
 ungroup() %>%
 spread(var, val) %>%
 group_by(id, name) %>%
 summarise_at(vars(starts_with("Sales")), sum, na.rm = TRUE)

  id    name  Sales_Feb1980 Sales_Jan1980
  <chr> <chr>         <dbl>         <dbl>
1 2b    LA              200           400
2 5a    NY              300            50

It sets to NA the values in Sales columns based on values in Returns columns and then performs the sum.

How to aggregating data conditional on other criteria

2 Answers2